Skip to end of metadata
Go to start of metadata

Investigator(s)

Stanislav Barton

Dataset

Internet Memory Web Archive

Platform

Central instance at IMF

Purpose of the Experiment

From the tools developed in the scope of the project (in the preservation components sub-project), we selected the MarcAlizer tool, the first version of the Pagelyzer tool developed by UPMC, that performs the visual comparison between two web pages. The Markalyzer was then wrapped by the Internet Memory so that it can be used within its infrastructure and the SCAPE platform. In a second phase, the renderability analysis should also include the structural comparison of the pages, which is implemented by the Pagelyser tool.

Since the core analysis for the renderability is thus performed by an external tool, the overall performance of the wrapped tool will be tight to this external dependency. We will keep integrating the latest releases issued from the MarcAlizer development, as well as the updates on the tool issued from a more specific training.

Workflow

The detection of the rendering issues is done in the following three steps:

  1. Web pages screenshots automatically taken using Selenium framework, for different browser versions.
  2. Visual comparison between pairs of screenshots using MarcAlizer tool (recently replaced by PageAlizer tool, to include also the structural comparison).
  3. Automatically detect the rendering issues in the Web pages, based on the comparison results.

The wrapper application orchestrates the main building blocks (Selenium instances and MarcAlizer comparators) and performs large scale experiments on archived Web content.

The browser versions currently experienced and tested are: Firefox (for all the available releases), Chrome (only for the last version), Opera (for the official 11th and 12th versions) and Internet Explorer (still to be fixed).

The initial implementation is represented by several Python scripts, running on a Debian Squeeze (64 bits) platform. This version of the wrapped tool was released on GitHub and we received some valuable feedback from the sub-project partners:

https://github.com/crawler-IM/browser-shots-tool

The deployment and installation of the wrapped tool are rather easy, but strongly dependent on different other packages, since it uses "off-the-shelf" components that need to be already available on your system, such as:

  • Python 2.6 or higher
  • Selenium 2.24.1
  • MarcAlizer 0.9

In order to make all the tools run together, in a suitable environment, the following applications/packages need to be installed:

  1. Selenium driver for the browsers: provided by Selenium in the Python Client on its official website (for example, the driver for Firefox is used in this project). Reference: http://pypi.python.org/pypi/selenium
  2. If the Graphical User Interface (GUI) is not available on your system, you can use an X server (for example, we used Xvfb v 11). The packages to be installed in this case are: xvfb, xfonts-base, xfonts-75dpi, xfonts-100dpi, libgl1-mesa-dri, xfonts-scalable, xfonts-cyrillic, gnome-icon-theme-symbolic
  3. Python: one can check the installed version by typing the command line: $ python

For the preliminary rounds of tests, we deployed the tool on three nodes of IM's cluster and we performed automated comparisons for around 440 pairs of URLs. The processing time registered in average was about 16 seconds per pair of Web pages. These results showed that the existing solution is suitable for small-scale analysis only. Most of the time in the process is actually represented by IO operations and disk access to the binary files for the snapshots. Taking the screenshots proven to be very time consuming and therefore if this solution is to be deployed on a large scale, the solution needed to be further optimized and parallelized.

These results showed also that a serious bottleneck for the performance of the tool is represented by the passage of intermediary parameters in between the modules. More precisely, the materialization of the screenshots in binary files on the disk is a very time consuming operation, especially when considering large scale experiments on a large number of Web pages.

We therefore have to move to a different implementation of the tool, which will use an optimized version of MarcAlizer. The Web pages screenshots taken with Selenium will be directly passed over to MarcAlizer comparator using streams and the new implementation of the browser-shots tool will be represented by a MapReduce job, running on a Hadoop cluster. Based on this framework, the current rounds of tests could be extended up to much higher number of pairs of URLs.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.