View Source

h2. Evaluator(s)

Stanislav Barton, Internet Memory

h2. *Evaluation specs platform/system level*

| *Field* | *Data type* | *Value* |
| *Evaluation seq. num.* | int | 1 |
| *Evaluator-ID* | email | [email protected] |
| *Evaluation description* | text | The IMF takes into account the quality of archived web sites. The quality is assured by a visual inspection: comparing the site in Internet with the archived site in IMF servers. \\
In order to improve that process, IMF is trying to develop an application, using the Pagelyzer developed UPMC, which compares two images. These two images are produced by Selenium based framework (V.2.24.1) by taking two snapshots: ideally, one is taken from the archive access and the second from the live. \\
\\
Workflow: \\
1° Load live page, take screen shot (Selenium + Firefox headless) \\
2° Load web page from archive, take screen shot(Selenium + Firefox headless) \\
3° Visual comparison of screenshots (Pagelyzer) \\
4° Produce the output result file (score of comparison) \\
\\
*Goal / Sub-goal:* \\
*          Performance efficiency / Throughput*
* Loading webpages can take time and depends on different factors such as the complexity of the page, the Internet connection, the browser and browser version used and/or the status of remote servers.
* Taking the screenshot using Selenium Compare with Pagelyzer  overhead (preparation of next comparison) \\
\\
*Reliability / Stability Indicators* \\
The external tools needed are :
* Selenium Firefox (for this evaluation)
* Xvfb (A graphical server, needed to run Firefox in virtual screen)
* Pagelyzer \\
The application is developed in Java/Ruby \\
All needed components are installed separately (dependencies of packages) \\
\\
*Reliability / Runtime stability*
* The result has been measured as a float number that can measure and detect the differences between two images |
| *Evaluation-Date* | DD/MM/YY | 01/09/2014 |
| *Platform-ID* | string | |
| *Dataset(s)* | string | Sample of 2.6 millions URLs from [IMF Web Archive|SP:Internet Memory Web Archive]\\ |
| *Workflow method* | string | MapReduce job using selenium and Pagelyzer internally \\ |
| *Workflow(s) involved* | URL(s) | \\ |
| *Tool(s) involved* | URL(s) | \\ |
| *Link(s) to Scenario(s)* | URL(s) | |
* *

* *

h2. *Platform IMF 2*

| *Field* | *Data type* | *Value* |
| *Platform-ID* | String | IMF Cluster 2                                                               |
| *Platform description* | String | Cloudera CDH4.6 \\
43 nodes |
| *Number of nodes* | integer | 43 |
| *Total number of physical CPUs* | integer | 43 |
| *CPU specs* | string | 15 * Dual core AMD G-T56N on 1600MHz, \\
28 * Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz          |
| *Total number of CPU-cores* | integer | 142 Cores (15 * 2 Cores + 28 * 4 Cores)      |
| *Total amount of RAM in Gbytes* | integer | 568GB (15 * 8GB + 28 * 16)        |
| *average CPU-cores for nodes* | integer | 3.3 |
| *average RAM in Gbytes for nodes* | integer | 13.2 |
| *Operating System on nodes* | String | Debian 6 squeeze (64bit) |
| *Storage system/layer* | String | HDFS |
| *Network layer between nodes* | String | Local copy between two nodes : 80 MB/s 640 Mbps |
\\

h2. *Evaluation points*

| *Metric* | *Baseline definition* | *Baseline value* | *Goal* | *Evaluation 1 (01/09/2014)* |
| *NumberOfObjectsPerSecond* | Number of comparisons made per hour | 0 | 3 | 4 |
| *ScoresAchieved* | Frequency of similarity scores assessed by Pagelyzer | 0 | 0 | 0 |
| *TotalNumberOfURLsProcessed* | Total number of URLs used for comparison | 0 | 2,600,000 | 2,600,000 |
| *AverageGetTimeFromArchive* | Average time spent getting page from web archive in seconds | 0 | 2 | 1.7 |
| *AverageGetTimeFromLive* | Average time spent getting page from live web | 0 | 2 | 2 |
| *AveragePagelyzerTime* | Average time spent comparing snapshots | 0 | 2 | 1.7 |
| * * | \\ | \\ | \\ | \\ |