Stanislav Barton, Internet Memory

Evaluation specs platform/system level

Field Data type Value
Evaluation seq. num. int 1
Evaluator-ID email
Evaluation description text The IMF takes into account the quality of archived web sites. The quality is assured by a visual inspection: comparing the site in Internet with the archived site in IMF servers.
In order to improve that process, IMF is trying to develop an application, using the Pagelyzer developed UPMC, which compares two images. These two images are produced by Selenium based framework (V.2.24.1) by taking two snapshots: ideally, one is taken from the archive access and the second from the live.

1° Load live page, take screen shot (Selenium + Firefox headless)
2° Load web page from archive, take screen shot(Selenium + Firefox headless)
3° Visual comparison of screenshots (Pagelyzer)
4° Produce the output result file (score of comparison)

Goal / Sub-goal:
          Performance efficiency / Throughput
  • Loading webpages can take time and depends on different factors such as the complexity of the page, the Internet connection, the browser and browser version used and/or the status of remote servers.
  • Taking the screenshot using Selenium Compare with Pagelyzer  overhead (preparation of next comparison)

    Reliability / Stability Indicators
    The external tools needed are :
  • Selenium Firefox (for this evaluation)
  • Xvfb (A graphical server, needed to run Firefox in virtual screen)
  • Pagelyzer
    The application is developed in Java/Ruby
    All needed components are installed separately (dependencies of packages)

    Reliability / Runtime stability
  • The result has been measured as a float number that can measure and detect the differences between two images
Evaluation-Date DD/MM/YY 01/09/2014
Platform-ID string  
Dataset(s) string Sample of 2.6 millions URLs from IMF Web Archive
Workflow method string MapReduce job using selenium and Pagelyzer internally
Workflow(s) involved URL(s)
Tool(s) involved URL(s)
Link(s) to Scenario(s) URL(s)  



Platform IMF 2

Field Data type Value
Platform-ID String IMF Cluster 2                                                              
Platform description String Cloudera CDH4.6
43 nodes
Number of nodes integer 43
Total number of physical CPUs integer 43
CPU specs string 15 * Dual core AMD G-T56N on 1600MHz,
28 * Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz         
Total number of CPU-cores integer 142 Cores (15 * 2 Cores + 28 * 4 Cores)     
Total amount of RAM in Gbytes integer 568GB (15 * 8GB + 28 * 16)       
average CPU-cores for nodes integer 3.3
average RAM in Gbytes for nodes integer 13.2
Operating System on nodes String Debian 6 squeeze (64bit)
Storage system/layer String HDFS
Network layer between nodes String Local copy between two nodes : 80 MB/s 640 Mbps

Evaluation points

Metric Baseline definition Baseline value Goal Evaluation 1 (01/09/2014)
NumberOfObjectsPerSecond Number of comparisons made per hour 0 3 4
ScoresAchieved Frequency of similarity scores assessed by Pagelyzer 0 0 0
TotalNumberOfURLsProcessed Total number of URLs used for comparison 0 2,600,000 2,600,000
AverageGetTimeFromArchive Average time spent getting page from web archive in seconds 0 2 1.7
AverageGetTimeFromLive Average time spent getting page from live web 0 2 2
AveragePagelyzerTime Average time spent comparing snapshots 0 2 1.7

