Pagelyzer

Skip to end of metadata
Go to start of metadata

Summary

Purpose Tool for the web pages comparison based on structural and visual approach. Research challenge for this tool is the learning algorithm based on frequency.
Homepage

Source Code Repository
https://github.com/openplanets/pagelyzer
License
As Is
Debian Package http://deb.openplanetsfoundation.org/pool/main/p/pagelyzer-ruby/

Description

Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.

It is based on:

  • a combination of structural and visual comparison methods embedded in a statistical discriminative model,
  • a visual similarity measure designed for Web pages that improves change detection,
  • a supervised feature selection method adapted to Web archiving.

We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach.

Installation manual can be found here 

How does it work?

Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser.

Step 2: In this step each page is segmented based on it's DOM tree information and visual rendering. In the previous version of the tool, called Marcalizer, VIPS [1] is used to segment web pages. However, we developed a new algorithm called Page-o-Metric [3], which removes the VIPS restriction of using IE as a web browser and also enhances the precision of visual block extraction and the hierarchy construction. At the end of this step, two XML trees, called Vi-XML representing the segmented web pages are returned. [2]. The details of this approach can be found in [3].

Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors [4]. For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also based on the Vi-XML files differences [5]. The structural and visual differences are merged to obtain a similarity vector according to [6].

References:

[1] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003.

[2] Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009

[3] Sanoja, Gançarski S. "Yet another Web Page Segmentation Tool". Proceedings iPRES 2012. Toronto. Canada, 2012

[4] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004

[5] Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. DEXA (1) 2010: 1-15

[6] M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012

User Experiences

SO18 Comparing two web page versions for web archiving

News Feeds

Activity Feed

Release Feed

Labels:
characterisation characterisation Delete
webarchive webarchive Delete
qa qa Delete
tool tool Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.