Skip to end of metadata
Go to start of metadata
Title
Incompleteness and/or inconsistency of web archive data
Detailed description The best practice in preserving websites is by crawling them using a web crawler like Heritrix. However, crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler and thus not captured and preserved. So if the aim is to create a high quality web archive, doing quality assurance is essential.
Currently, quality assurance requires manual effort and is expensive. Since crawls often contain thousands of pages, manual quality assurance will be neither very efficient nor effective. It might make sense for “topic” crawls but remains time consuming and costly. Especially for large scale crawls, automation of the quality control processes is a necessary requirement.
Scalability Challenge
QA on large crawls, on the fly checking, regular check of the whole Web archive quality to detect access issues
Issue champion Leïla Medjkoune (IM), René Voorburg (KB)
Other interested parties

Possible Solution approaches
  • ALL
    • Some efforts are undertaken to use a setup with sets of standard web browsers running 'headless' and  a Wayback Machine in proxy mode. Using this method, links that are missed by the crawler but are detected by the browser can be recorded using the Wayback logs might be thus added to the harvest.
    • The headless browser approach will help to detect and solve a certain set of problems, specifically links that were missed due to the highly interactive nature of the pages involved or due to robot evasive measures. The approach doesn't help answering QA questions like: is the harvested site still active, how much has the content changed (or should we lower or raise the harvesting frequency), has part of the content moved to a different domain, etc.
    • The approach suggested to tackle these potential quality issues is to create reference images for each crawled site. A reference image is a snapshot that has undergone the usual (still labor intensive) manual quality assurance process or a screenshot of the live taken while crawling on a number of different browsers.  Using this reference image each new crawl can be compared by automated means. This can be done using various metrics as indicators, like the change in the size of the crawl, the number of changed pages, but also more advanced methods like automated visual comparison of percentage and location of changed in the rendered pages.
    • Now with the various metrics as indicators on the change in the new crawl compared to the reference crawl various actions can be undertaken: the changes very minor: the quality of this crawl is good, a lower crawl frequency can be set;  the changes are relatively small: the quality of this crawl is good;  the change are relatively big: manual inspection is required.  When this crawl is approved it will become the new reference crawl. We could also envisage that such automated comparison could triger automated actions that would improve crawls completeness/quality and allow monitoring of the Webarchives from a qualitative and long term preservation perspective. Overall this approach is likely improve the effectivess, efficiency and scalability of quality assurance for specific crawls and Web Archives in general. It is strongly related to PW-12: Automated Watch as outlined in WCT7.
    • Another approach would consist in taking screenshots of the live web pages as the crawl is running to compare these to screenshots of the same rendered web pages (though a wayback alike tool). This would allow pointing any major capture or access issue and would limit the human manual QA intervention
  • EXL
    • We agree this is a difficult problem. Even if we were able to take a screen shot of the crawled website and perform a reliable image comparison of the two scans, it would be difficult to differentiate between changes which are errors (incomplete scans, etc) versus content.
Context
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets
Solutions SO18 Comparing two web page versions for web archiving

Evaluation

Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario
Labels:
webarchive webarchive Delete
characterisation characterisation Delete
qa qa Delete
identification identification Delete
issue issue Delete
watch watch Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.