Skip to end of metadata
Go to start of metadata

Dataset:

Title
Internet Memory Web collections
Description The data consists in web content crawled, stored and hosted by the Internet Memory Foundation (W)ARC format (approx. 300TB)
Using this content, IM can also use its taskforce (QA team) to provide annotated data such as pairs of annotated snapshots for quality assurance scenarios.
1000 annotated paires of web pages (similar/dissimilar) were produced as part of PC.WP3: Quality Assurance Components.
Licensing Web collections crawled on behalf of partner institutions will require institutions agreement to be used by SCAPE partners
Owner Internet Memory
Dataset Location Provided upon request
Collection expert Leïla Medjkoune (IM)
Issues brainstorm A bulleted list of possible preservation or business driven Issues. This is useful for describing ideas that might be turned into detailed Issues at a later date
List of Issues A list of links to detailed Issue pages relevant to this Dataset


Issue:

Title
Incompleteness and/or inconsistency of web archive data
Detailed description The best practice in preserving websites is by crawling them using a web crawler like Heritrix. However, crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler and thus not captured and preserved. So if the aim is to create a high quality web archive, doing quality assurance is essential.
Currently, quality assurance requires manual effort and is expensive. Since crawls often contain thousands of pages, manual quality assurance will be neither very efficient nor effective. It might make sense for “topic” crawls but remains time consuming and costly. Especially for large scale crawls, automation of the quality control processes is a necessary requirement.
Scalability Challenge
QA on large crawls, on the fly checking, regular check of the whole Web archive quality to detect access issues
Issue champion Leïla Medjkoune (IM), René Voorburg (KB)
Other interested parties

Possible Solution approaches
  • ALL
    • Some efforts are undertaken to use a setup with sets of standard web browsers running 'headless' and  a Wayback Machine in proxy mode. Using this method, links that are missed by the crawler but are detected by the browser can be recorded using the Wayback logs might be thus added to the harvest.
    • The headless browser approach will help to detect and solve a certain set of problems, specifically links that were missed due to the highly interactive nature of the pages involved or due to robot evasive measures. The approach doesn't help answering QA questions like: is the harvested site still active, how much has the content changed (or should we lower or raise the harvesting frequency), has part of the content moved to a different domain, etc.
    • The approach suggested to tackle these potential quality issues is to create reference images for each crawled site. A reference image is a snapshot that has undergone the usual (still labor intensive) manual quality assurance process or a screenshot of the live taken while crawling on a number of different browsers.  Using this reference image each new crawl can be compared by automated means. This can be done using various metrics as indicators, like the change in the size of the crawl, the number of changed pages, but also more advanced methods like automated visual comparison of percentage and location of changed in the rendered pages.
    • Now with the various metrics as indicators on the change in the new crawl compared to the reference crawl various actions can be undertaken: the changes very minor: the quality of this crawl is good, a lower crawl frequency can be set;  the changes are relatively small: the quality of this crawl is good;  the change are relatively big: manual inspection is required.  When this crawl is approved it will become the new reference crawl. We could also envisage that such automated comparison could triger automated actions that would improve crawls completeness/quality and allow monitoring of the Webarchives from a qualitative and long term preservation perspective. Overall this approach is likely improve the effectivess, efficiency and scalability of quality assurance for specific crawls and Web Archives in general. It is strongly related to PW-12: Automated Watch as outlined in WCT7.
    • Another approach would consist in taking screenshots of the live web pages as the crawl is running to compare these to screenshots of the same rendered web pages (though a wayback alike tool). This would allow pointing any major capture or access issue and would limit the human manual QA intervention
  • EXL
    • We agree this is a difficult problem. Even if we were able to take a screen shot of the crawled website and perform a reliable image comparison of the two scans, it would be difficult to differentiate between changes which are errors (incomplete scans, etc) versus content.
Context
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets
Solutions SO18 Comparing two web page versions for web archiving

Evaluation

Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario

Solutions:

Title Comparing two web page versions for web archiving
Detailed description Our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach.
Solution Champion
Sureda-Gutierrez Carlos (UPMC).
Corresponding Issue(s)
IS28 Structural and visual comparisons for web page archiving
IS7 Incompleteness and and inconsistency of web archive data
IS19 Migrate whole archive to new archiving system
myExperiment Link
MarcAlizer
Tool Registry Link
Pagelyzer
Evaluation
TBD
Labels:
webarchive webarchive Delete
scenario scenario Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Oct 11, 2012

    I agree with EXL on this. Shouldn't the crawled web pages be compared to the snapshot taken at the time of harvest instead of comparing with the web sites already in the archive?

    Newspaper pages, for example, change completely every minute.. how will one be able to detect that there is no problem with the harvest?

  2. Oct 23, 2012

    The evaluation section is not filled in yet.

  3. Dec 20, 2012

    Can someone provide success criteria for this scenario?!