technical documentation

compared with
Current by peri stracchino
on Jun 15, 2011 13:16.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (2)

View Page History
pHash supports comparison of jpeg and bmp formnatted images only, whereas our images were in tiff format, so  a prior step had to be to comvert the image formats to jpeg (we used gimp). We discovered the algorhythm selected is critical. The default radial hashing algorhythm was unable to identify similar images with different rotations correctly. However the DCT algorhythm was able to identify similar images with up to at least 5% rotatation consistently - the greatest required for our task was around 2% rotation so we felt this was satisfactory. The Marr/mexican hat algorhythm crashed the site every time we tried it, so we cant comment on this\! The output gives a decision as to whether or not the images are similar, the threshold used, and the hamming distance (the *Hamming* *distance* between two strings of equal length is the number of positions at which the corresponding symbols are different). The threshold is not settable on the demo page, but would be definable in an actual application developed around pHash. PHash  can be run against audio, video and image files

So we felt a viable solution could be built around this. Developing this would require a programme (C++, VB, Python are possible ) to be built around the phash url, this would then need to be run against each possible set of image pairs. The images are organised into folders of approx 30 images, rescans are only likely to be found within these folders. So each image within a folder would have to be tested for similarity against every other image in the folder. as there are over 100,000 images in the collection this is a substantial task, and its possible that it would be best to copy the file collection onto a seperate server and run the task there in order to avoid eating up provcessing power on the main server .

NOTE: I have since found a ruby implementation on the web at [http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/|http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/] which might be worth investigating
\\
\\