Pagelyzer

compared with
Current by Markus Plangg
on Nov 21, 2013 19:34.

(show comment)
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (3)

View Page History
| Purpose | Tool for the web pages comparison based on structural and visual approach. Research challenge for this tool is the learning algorithm based on frequency. |
| Homepage \\ | \\ |
| Source Code Repository \\ | [https://github.com/openplanets/pagelyzer] |
| License \\ | As Is \\ |
| Debian Package | [http://deb.openplanetsfoundation.org/pool/main/p/pagelyzer-ruby/] |
{color:#000000}Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser.{color}

{color:#000000}Step 2: In this step each page is segmented based on based on their it's DOM tree information and their visual rendering. In the previous version of the tool, called Marcalizer, VIPS \[1\] is used to segment web pages. However, we developed a new algorithm called Page-o-Metric \[3\], which removes the VIPS restriction of using IE as a web browser and also enhances the precision of visual block extraction and the hierarchy construction. At the end of this step, two XML trees, called Vi-XML representing the segmented web pages are returned. \[2\]. The details of this approach can be found in \[3\].{color}

{color:#000000}Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors \[4\]. For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also based on the Vi-XML files differences \[5\]. The structural and visual differences are merged to obtain a similarity vector according to \[6\].{color}



h2. News Feeds