Central instance at IMF
From the tools developed in the scope of the project (in the preservation components sub-project), we selected the MarcAlizer tool, the first version of the Pagelyzer tool developed by UPMC, that performs the visual comparison between two web pages. The Markalyzer was then wrapped by the Internet Memory so that it can be used within its infrastructure and the SCAPE platform. In a second phase, the renderability analysis should also include the structural comparison of the pages, which is implemented by the Pagelyser tool.
Since the core analysis for the renderability is thus performed by an external tool, the overall performance of the wrapped tool will be tight to this external dependency. We will keep integrating the latest releases issued from the MarcAlizer development, as well as the updates on the tool issued from a more specific training.
For this experiment the tool is implemented as a MapReduce job to parallelize the processing of the input. The input in this later case is a list of urls that together with a list of browser versions, that are used to render the screen shot - note the difference in comparison to the former version where the input where pairs of URLs that were rendered using one common browser version and these were compared.
In order to achieve acceptable running times of the tool newer version of the Marcalizer comparison tool was integrated into this tool. The major improvement brings the possibility of feeding to tool with in-memory objects instead of pointers to files on disk. This improvement and the elimination of the unnecessary IO operations lead into following average times got for the particular steps in the shot comparison:
- screenshot acquirement - 2s
- marcalizer comparison 2s
Note that the time to take the render the screenshot using a browser mainly depends on the size of the rendered page, for instance capturing a wsj.com page takes about 15s on the IM machine where the resulting jpeg image is as heavy as 10MB.
As you can see, the operations on the operations on the screenshots are very expensive (remember that the list of the tested browsers can be very long and for each we need to spend one browser screen shot operation). Therefore we need to parallelize the tool to several machines working on the input list of urls. To facilitate this, we have employed Hadoop MapReduce, which is part of the SCAPEs platform.
The result of the comparisons is then materialized in a set of XML files where each file represents one pair of browser shots comparisons. In order to alleviate the problem of having big numbers of small files, these files are automatically bundled together into one ZIP file.
In the moment, we have run preliminary tests on the currently supported browser versions - Firefox and Opera. The list of urls to test is about 13 000 entries long. We are using the IM central instance for these tests, currently having two worker nodes (thus we can cut the processing time to half in parallel execution).