Title |
IS41 Analyse huge text files containing information about a web archive |
Detailed description | Some web archive produce information about the content of a web archive on a periodical basis. The result is sometimes stored as huge text files. The processing of these text files is very time consuming, it is difficult to extract statistics and doing analysis on these text files. |
Scalability Challenge |
Depending on the size of the text files and the type of processing a scalability challenge might exist. |
Issue champion | Markus Raditsch![]() |
Other interested parties |
|
Possible Solution approaches | Hadoop is designed for analysing huge text files, using the MapReduce programming model should therefore be considered. |
Context | The web archive at the Austrian National Library produces huge text files as the result of the web harvesting process. |
Lessons Learned | |
Training Needs | Hadoop training. |
Datasets | Austrian National Library - Web Archive |
Solutions | SO27 Analyse huge text files containing information about a web archive using Hadoop |
Evaluation
Objectives | Automation, scalability |
Success criteria | The technical implementation will make it possible to analyze statistical date stored in large text files and provide an appropriate output (e.g. total size of all "image/gif" objects related to a certain crawlID). The input text files will be provided by a web crawler system in a specific format. |
Automatic measures | In a similar "traditional setup" (single quadcore server in sequential processing ) we measured a throughput of around 26.500 lines/second. Goal:
|
Manual assessment | Processing time can be influenced by adding / removing worker nodes to / from the cluster. |
Actual evaluations | links to acutual evaluations of this Issue/Scenario |
Evaluation URL |
EVAL-WCT8-1![]() |
Labels: