Skip to end of metadata
Go to start of metadata

Status

Active

This story is associated with a success story: http://wiki.opf-labs.org/display/SP/QA+and+Characterisation+of+Web+Content

Contact

Leila Medjkoune

User Story

In order to be confident that we have preserved a website we need a digital preservation system that can automate the comparison of the two Web Snapshots. This could be a harvested copy and a previous harvested copy that has been manually verified as an accurate representation of the site or a harvested copy and its live version. This will enable us to ensure Web content has been successfully harvested and inform harvesting policies.

User Requirements/Components

I need a tool to generate an image of a Web page found in a WARC/ARC file so that I can compare this image with the live copy.

I need a tool to compare two web page screenshots and to provide a similarity score so I can assess how closely the live site and the harvested site match visually.

  1. MUST be able to read WARC files
  2. SHOULD be able to read ARC files
  3. MUST continue after any network downtime
  4. SHOULD be robots.txt aware
  5. Similarity score MUST be normalised between 0 and 1, where 1 is identical and 0 is no similarity at all
  6. WARC comparison MUST occur within the update frequency of the live website

Experiments

Create experiments as child pages and they should appear automatically here

Re-use existing experiments over SB Web Archive (NBR)

Developer Notes

This is essentially the ARC to WARC migration without the migration. Makes sense to check out QA steps developed as part of that story.

Related Documents

Scenarios, case studies, etc. that provide background to this story.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.