Skip to end of metadata
Go to start of metadata
Structural and visual comparisons for web page archiving
Detailed description We propose in the context of Web archiving, a framework combining state-of-the-art comparison methods that use the source code of Web pages, with computer vision techniques to detect whether successive versions of a Web page are similar or not.
Scalability Challenge
The method can be used as a crawler
Issue champion Sureda-Gutierrez Carlos (UPMC).
Other interested parties
Possible Solution approaches A combination of web page structure and computer vision techniques for detecting significant changes between two web page versions
Lessons Learned
Training Needs
Datasets TBC
Solutions SO18 Comparing two web page versions for web archiving


Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario
issue issue Delete
qa qa Delete
obsolescence obsolescence Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. May 14, 2012

    This is not an Issue, its a Solution! What is the problem, challenge or issue that is being experienced with a specific Dataset? I would suggest moving the text in this page to a relevant Solution page.

    1. May 14, 2012

      Hi Paul, maybe question of rephrasing, but the measurement of the similarity of two web pages still remain an issue. The solution is the one in SO18, where the actual software to do that is introduced. People at UPMC are now working on the measure of success and benchmarking.

      1. May 14, 2012

        Hi Dennis. Yes, I'm not suggesting that this is not useful work! But this page describes a solution, not a preservation issue. Re-phrasing this to describe what the actual problem is and what the requirements are for the solution, would be useful. Without these details, designing the solution correctly, and evaluating the solution, becomes very difficult.

  2. May 14, 2012

    I must agree with Paul. On the Issue description you should describe a "problem" - why is it interesting to do this comparison at all. The description has a bit of this.

    The title could e.g. be "IS28 Need for automatic change detection of web pages" based on the fact that web archives like Internet Memory actually do manual inspection of web pages at the moment - and this does not by nature scale to millions of pages.