This story is associated with a success story: http://wiki.opf-labs.org/display/SP/QA+and+Characterisation+of+Web+Content
In order to be confident that we have preserved a website we need a digital preservation system that can automate the comparison of the two Web Snapshots. This could be a harvested copy and a previous harvested copy that has been manually verified as an accurate representation of the site or a harvested copy and its live version. This will enable us to ensure Web content has been successfully harvested and inform harvesting policies.
I need a tool to generate an image of a Web page found in a WARC/ARC file so that I can compare this image with the live copy.
I need a tool to compare two web page screenshots and to provide a similarity score so I can assess how closely the live site and the harvested site match visually.
- MUST be able to read WARC files
- SHOULD be able to read ARC files
- MUST continue after any network downtime
- SHOULD be robots.txt aware
- Similarity score MUST be normalised between 0 and 1, where 1 is identical and 0 is no similarity at all
- WARC comparison MUST occur within the update frequency of the live website
Create experiments as child pages and they should appear automatically here
Re-use existing experiments over SB Web Archive (NBR)
This is essentially the ARC to WARC migration without the migration. Makes sense to check out QA steps developed as part of that story.
Scenarios, case studies, etc. that provide background to this story.