Structural and visual comparisons for web page archiving
Detailed description We propose in the context of Web archiving, a framework combining state-of-the-art comparison methods that use the source code of Web pages, with computer vision techniques to detect whether successive versions of a Web page are similar or not.
Scalability Challenge
The method can be used as a crawler
Issue champion Sureda-Gutierrez Carlos (UPMC).
Possible Solution approaches A combination of web page structure and computer vision techniques for detecting significant changes between two web page versions
Solutions SO18 Comparing two web page versions for web archiving


  1. May 14, 2012

    This is not an Issue, its a Solution! What is the problem, challenge or issue that is being experienced with a specific Dataset? I would suggest moving the text in this page to a relevant Solution page.

    1. May 14, 2012

      Hi Paul, maybe question of rephrasing, but the measurement of the similarity of two web pages still remain an issue. The solution is the one in SO18, where the actual software to do that is introduced. People at UPMC are now working on the measure of success and benchmarking.

      1. May 14, 2012

        Hi Dennis. Yes, I'm not suggesting that this is not useful work! But this page describes a solution, not a preservation issue. Re-phrasing this to describe what the actual problem is and what the requirements are for the solution, would be useful. Without these details, designing the solution correctly, and evaluating the solution, becomes very difficult.

  2. May 14, 2012

    I must agree with Paul. On the Issue description you should describe a "problem" - why is it interesting to do this comparison at all. The description has a bit of this.

    The title could e.g. be "IS28 Need for automatic change detection of web pages" based on the fact that web archives like Internet Memory actually do manual inspection of web pages at the moment - and this does not by nature scale to millions of pages.