The technical challenge faced here was to identify similar images of the same subject that are not in fact identical. The image set consists of a collection of scanned pages, some of which have been rescanned at slightly different rotations (up to 2%). The curator (gareth) would like to identify and rationalise these, however the only means of doing so at the moment is to eyeball the individual images - a horrendous task. Most are text based. From a curatorial point of view these are duplicates, however from a technical point of view they are not exact duplicates but different images of the same subject. Hence the problem of identifying 'duplicates' is not straightforward, so we focussed on investigating and testing technologies that could provide the basis for a future solution.
The first technology we tried was perceptualDiff, which looked promising because it allows user the same.definition of a threshold value of a number of pixels difference which can be ignored for the prurposes of deciding whether two slightly disimilar images should be considered effectively the same. However we found this was not a viable solution for this particular problem because it was unable to compare pairs of images of differing dimensions, and the images in our dataset were not all of the same dimensions, not even rescanned versions of the same page.
official download available at http://sourceforge.net/projects/pdiff
however if using this you also need to download freeimage dll
But there is also a downloadable version with dependencies included at http://www.tilander.org/aurora2/Comparing_Images/index.html
So we thought Ois also a CR might provide a solution. If we could compare the outputs of the various text based images we could establish whichr images were essentially the same as each other. Sven and George offered the use of their Tesseract OCR service for test purposes. They also suggested the use of a script they have developed which compares levenschtein distances between texts - this representing the number of edits required to make the two scripts identical - in order to decide whether two texts are essentially the same. Initial tests with a couple of sample files from outside the collection worked as expected, however when we tried to use this procedure on our actual image set we found there was a problem with the images which tesseract was unable to handle well, and the output was not usable. This MAY be a problem with the particular Tesseract service rather than tesseract itself, - although documentation on the web does suggest that tesseract works best on black and white text but does not handle grayscale or colour text images well, and our images were in fact grayscale
In addition it emerged when looking through the collection for suitable pairs of images to test that not all the images were text baes, so OCR would not be a suitable means of testing these for rescans.
So we took a look at pHash (www.phash.org)
Phash looked very promising. pHash stands for perceptual hashing, and aims to provide robust image comparison which can allow for various transformations or "attacks" including rotations. phash is not a fully developed application, its a c++ library which builds as a dll file, however the phash site includes a demo page which allows images to be submitted for comparison with various hash algorhythms, so we were able to use this to test various sample image pairs, although as we could only test on a pair by pair basis the number of images we could test was limited - we tried to test a representative sample but no real bulk testing has been done.
pHash supports comparison of jpeg and bmp formnatted images only, whereas our images were in tiff format, so a prior step had to be to comvert the image formats to jpeg (we used gimp). We discovered the algorhythm selected is critical. The default radial hashing algorhythm was unable to identify similar images with different rotations correctly. However the DCT algorhythm was able to identify similar images with up to at least 5% rotatation consistently - the greatest required for our task was around 2% rotation so we felt this was satisfactory. The Marr/mexican hat algorhythm crashed the site every time we tried it, so we cant comment on this! The output gives a decision as to whether or not the images are similar, the threshold used, and the hamming distance (the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different). The threshold is not settable on the demo page, but would be definable in an actual application developed around pHash. PHash can be run against audio, video and image files
So we felt a viable solution could be built around this. Developing this would require a programme (C++, VB, Python are possible ) to be built around the phash url, this would then need to be run against each possible set of image pairs. The images are organised into folders of approx 30 images, rescans are only likely to be found within these folders. So each image within a folder would have to be tested for similarity against every other image in the folder. as there are over 100,000 images in the collection this is a substantial task, and its possible that it would be best to copy the file collection onto a seperate server and run the task there in order to avoid eating up provcessing power on the main server .
NOTE: I have since found a ruby implementation on the web at http://www.mikeperham.com/2010/05/21/detecting-duplicate-images-with-phashion/ which might be worth investigating
phash home page: http://www.phash.org