Evaluator(s)
Stanislav Barton, Internet Memory
Evaluation specs platform/system level
Field | Data type | Value |
Evaluation seq. num. | int | 1 |
Evaluator-ID | [email protected] | |
Evaluation description | text | The IMF takes into account the quality of archived web sites. The quality is assured by a visual inspection: comparing the site in Internet with the archived site in IMF servers. In order to improve that process, IMF is trying to develop an application, using the Pagelyzer developed UPMC, which compares two images. These two images are produced by Selenium based framework (V.2.24.1) by taking two snapshots: ideally, one is taken from the archive access and the second from the live. Workflow: 1° Load live page, take screen shot (Selenium + Firefox headless) 2° Load web page from archive, take screen shot(Selenium + Firefox headless) 3° Visual comparison of screenshots (Pagelyzer) 4° Produce the output result file (score of comparison) Goal / Sub-goal: Performance efficiency / Throughput
|
Evaluation-Date | DD/MM/YY | 01/09/2014 |
Platform-ID | string | |
Dataset(s) | string | Sample of 2.6 millions URLs from IMF Web Archive |
Workflow method | string | MapReduce job using selenium and Pagelyzer internally |
Workflow(s) involved | URL(s) | |
Tool(s) involved | URL(s) | |
Link(s) to Scenario(s) | URL(s) |
Platform IMF 2
Field | Data type | Value |
Platform-ID | String | IMF Cluster 2 |
Platform description | String | Cloudera CDH4.6 43 nodes |
Number of nodes | integer | 43 |
Total number of physical CPUs | integer | 43 |
CPU specs | string | 15 * Dual core AMD G-T56N on 1600MHz, 28 * Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz |
Total number of CPU-cores | integer | 142 Cores (15 * 2 Cores + 28 * 4 Cores) |
Total amount of RAM in Gbytes | integer | 568GB (15 * 8GB + 28 * 16) |
average CPU-cores for nodes | integer | 3.3 |
average RAM in Gbytes for nodes | integer | 13.2 |
Operating System on nodes | String | Debian 6 squeeze (64bit) |
Storage system/layer | String | HDFS |
Network layer between nodes | String | Local copy between two nodes : 80 MB/s 640 Mbps |
Evaluation points
Metric | Baseline definition | Baseline value | Goal | Evaluation 1 (01/09/2014) |
NumberOfObjectsPerSecond | Number of comparisons made per hour | 0 | 3 | 4 |
ScoresAchieved | Frequency of similarity scores assessed by Pagelyzer | 0 | 0 | 0 |
TotalNumberOfURLsProcessed | Total number of URLs used for comparison | 0 | 2,600,000 | 2,600,000 |
AverageGetTimeFromArchive | Average time spent getting page from web archive in seconds | 0 | 2 | 1.7 |
AverageGetTimeFromLive | Average time spent getting page from live web | 0 | 2 | 2 |
AveragePagelyzerTime | Average time spent comparing snapshots | 0 | 2 | 1.7 |
|
|
|
|
Labels:
None