Collection:
Title |
Austrian National Library - Web Archive |
Description | The Austrian National Library uses a representative datasets from their webarchive: - events selective crawls: during an event frequently harvested sites, e.g. EU election 2009, Olympia 2010, - domain crawls 2009 from about 1 million domains. The web archive data is available in the ARC.GZ format. The size of the ARC.GZ data set is 1377GB. The metadata log file produced during the crawl process is available as txt file and has a size of 197GB. |
Licensing | Sample only available to SCAPE partners. |
Owner | Austrian National Library (ONB) |
Collection expert | Prändl-Zika Veronika![]() |
Issues brainstorm | |
List of Issues | IS25 Web Content Characterisation IS41 Analyse huge text files containing information about a web archive |
Issue:
Title |
IS41 Analyse huge text files containing information about a web archive |
Detailed description | Some web archive produce information about the content of a web archive on a periodical basis. The result is sometimes stored as huge text files. The processing of these text files is very time consuming, it is difficult to extract statistics and doing analysis on these text files. |
Scalability Challenge |
Depending on the size of the text files and the type of processing a scalability challenge might exist. |
Issue champion | Markus Raditsch![]() |
Other interested parties |
|
Possible Solution approaches | Hadoop is designed for analysing huge text files, using the MapReduce programming model should therefore be considered. |
Context | The web archive at the Austrian National Library produces huge text files as the result of the web harvesting process. |
Lessons Learned | |
Training Needs | Hadoop training. |
Datasets | Austrian National Library - Web Archive |
Solutions | SO27 Analyse huge text files containing information about a web archive using Hadoop |
Evaluation
Objectives | Automation, scalability |
Success criteria | The technical implementation will make it possible to analyze statistical date stored in large text files and provide an appropriate output (e.g. total size of all "image/gif" objects related to a certain crawlID). The input text files will be provided by a web crawler system in a specific format. |
Automatic measures | In a similar "traditional setup" (single quadcore server in sequential processing ) we measured a throughput of around 26.500 lines/second. Goal:
|
Manual assessment | Processing time can be influenced by adding / removing worker nodes to / from the cluster. |
Actual evaluations | links to acutual evaluations of this Issue/Scenario |
Evaluation URL |
EVAL-WCT8-1![]() |
Solutions:
Title | SO27 Analyse huge text files containing information about a web archive using Hadoop |
Detailed description | Analyse huge text files containing information about a web archive using Hadoop |
Solution Champion |
Markus Raditsch ![]() |
Corresponding Issue(s) |
|
myExperiment Link |
|
Tool Registry Link |
|
Evaluation |
|
Labels:
1 Comment
comments.show.hideOct 11, 2012
Miguel Ferreira
Is this a preservation action?!