Internet Memory Web Archive : Sample of 2,6 millions URLs
The IMF takes into account the quality of archived web sites. The quality is assured by a visual inspection: comparing the site in Internet with the archived site in IMF servers.
In order to improve that process, IMF is trying to develop an application, using the Pagelyzer developed UPMC, which compares two images. These two images are produced by Selenium based framework (V.2.24.1) by taking two snapshots: ideally, one is taken from the archive access and the second from the live.
- Load live page, take screen shot (Selenium + Firefox headless)
- Load web page from archive, take screen shot (Selenium + Firefox headless)
- Visual comparison of screenshots (Pagelyzer)
- Produce the output result file (score of comparison)
The difference between the previous multi-node experiment is in the deployment of the selenium tool (previously on a separate cluster). Now, the Selenium + headless Firefox is run on every processing machine.
The requirements are to be able to process large amount of URLs (comparisons) in a reasonable time (days) on the available infrastructure (mid-size cluster, see description of platform in the evaluation). The previous experiments showed that the comparison using Pagelyzer is very time consuming (2s), the rendering of a page as well (2s). So we use these values as a goal. Closer analysis showed that page rendering of a page consists of three components:
- Getting the source of a page (depends on the speed of the connection, saturation of link, speed of remote web server)
- Rendering the page (depends on the speed of the local machine running headless Firefox)
- Getting the snapshot (depends on the size of the page)
Note that 1. can not be addressed by any optimization.
The outcome of this experiment is the frequency of scores coming from Pagelyzer that helps assess the quality of the crawl.