Corrupted JPEG and JPEG2000 files solution

Skip to end of metadata
Go to start of metadata
Title Corrupted JPEG and JPEG2000 files solution
Detailed description I tried some edge detection, but the pages full of text and line drawings had too many edges, so the edges of the corrupted areas were no more visible than before.
I converted the JPEGs to smaller 1 bit PNGs, so that processing them would be quicker.

I wrote a Python2 script to find areas of black. The program would first look for rows which had a higher than average number of black pixels and were contiguous. Within these rows, it would then look for columns which were largely black and contiguous. It reports files which have such areas and also produces mask image files which show where the black areas were found.
It is run as so:

The mask files will be put into small_newspapers, and the names of any images with black areas will be put into results.txt.

Solution Champion Swithun Crowe
Corresponding Issue(s)
Tool/code link SPRUCE/tree/master/black_pixels
Tool Registry Link
  • Solution champion: I'd hoped to use Fortran77, but the PBM (ascii) library didn't work. Python is reasonably fast at processing the small images, but slow on the original images.
  • Converts to monchrome image (PBM), isolates corruption where there's a large area of black, identifies contiguous rows where there is a high black pixel count, creates bounding box
  • Weightings for identifying corruption areas can be tweaked 
  • Issue owner: Excellent! This has been looked at in previous mashups but not solved. Can now take this away and run it over the collection to determine scope of problem, and more thoroughly test accuracy of the solution.
spruce_glasgow spruce_glasgow Delete
spruce spruce Delete
solution solution Delete
bit_rot_detection bit_rot_detection Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.