Dataset:
Title |
Internet Memory Web collections |
Description | The data consists in web content crawled, stored and hosted by the Internet Memory Foundation (W)ARC format (approx. 300TB) Using this content, IM can also use its taskforce (QA team) to provide annotated data such as pairs of annotated snapshots for quality assurance scenarios. 1000 annotated paires of web pages (similar/dissimilar) were produced as part of PC.WP3: Quality Assurance Components. |
Licensing | Web collections crawled on behalf of partner institutions will require institutions agreement to be used by SCAPE partners |
Owner | Internet Memory |
Dataset Location | Provided upon request |
Collection expert | Leïla Medjkoune![]() |
Issues brainstorm | A bulleted list of possible preservation or business driven Issues. This is useful for describing ideas that might be turned into detailed Issues at a later date |
List of Issues | A list of links to detailed Issue pages relevant to this Dataset |
Issue:
Title |
Incompleteness and/or inconsistency of web archive data |
Detailed description | The best practice in preserving websites is by crawling them using a web crawler like Heritrix. However, crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler and thus not captured and preserved. So if the aim is to create a high quality web archive, doing quality assurance is essential. Currently, quality assurance requires manual effort and is expensive. Since crawls often contain thousands of pages, manual quality assurance will be neither very efficient nor effective. It might make sense for “topic” crawls but remains time consuming and costly. Especially for large scale crawls, automation of the quality control processes is a necessary requirement. |
Scalability Challenge |
QA on large crawls, on the fly checking, regular check of the whole Web archive quality to detect access issues |
Issue champion | Leïla Medjkoune![]() |
Other interested parties |
|
Possible Solution approaches |
|
Context | |
Lessons Learned | Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices) |
Training Needs | Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP. |
Datasets | |
Solutions | SO18 Comparing two web page versions for web archiving |
Evaluation
Objectives | Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation |
Success criteria | Describe the success criteria for solving this issue - what are you able to do? - what does the world look like? |
Automatic measures | What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important? If possible specify very specific measures and your goal - e.g. * process 50 documents per second * handle 80Gb files without crashing * identify 99.5% of the content correctly |
Manual assessment | Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue? If possible specify measures and your goal - e.g. * Solution installable with basic linux system administration skills * User interface understandable by non developer curators |
Actual evaluations | links to acutual evaluations of this Issue/Scenario |
Solutions:
Title | Comparing two web page versions for web archiving |
Detailed description | Our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach. |
Solution Champion |
Sureda-Gutierrez Carlos![]() |
Corresponding Issue(s) |
IS28 Structural and visual comparisons for web page archiving IS7 Incompleteness and and inconsistency of web archive data IS19 Migrate whole archive to new archiving system |
myExperiment Link |
MarcAlizer![]() |
Tool Registry Link |
Pagelyzer |
Evaluation |
TBD |
Labels:
3 Comments
comments.show.hideOct 11, 2012
Miguel Ferreira
I agree with EXL on this. Shouldn't the crawled web pages be compared to the snapshot taken at the time of harvest instead of comparing with the web sites already in the archive?
Newspaper pages, for example, change completely every minute.. how will one be able to detect that there is no problem with the harvest?
Oct 23, 2012
Miguel Ferreira
The evaluation section is not filled in yet.
Dec 20, 2012
Miguel Ferreira
Can someone provide success criteria for this scenario?!