Skip to end of metadata
Go to start of metadata
IS41 Analyse huge text files containing information about a web archive
Detailed description Some web archive produce information about the content of a web archive on a periodical basis. The result is sometimes stored as huge text files. The processing of these text files is very time consuming, it is difficult to extract statistics and doing analysis on these text files.
Scalability Challenge
Depending on the size of the text files and the type of processing a scalability challenge might exist.
Issue champion Markus Raditsch (ONB)
Other interested parties
Possible Solution approaches Hadoop is designed for analysing huge text files, using the MapReduce programming model should therefore be considered.
Context The web archive at the Austrian National Library produces huge text files as the result of the web harvesting process.
Lessons Learned
Training Needs Hadoop training.
Datasets Austrian National Library - Web Archive
Solutions SO27 Analyse huge text files containing information about a web archive using Hadoop


Objectives Automation, scalability
Success criteria The technical implementation will make it possible to analyze statistical date stored in large text files and provide an appropriate output (e.g. total size of all "image/gif" objects related to a certain crawlID). The input text files will be provided by a web crawler system in a specific format.
Automatic measures In a similar "traditional setup" (single quadcore server in sequential processing ) we measured a throughput of around 26.500 lines/second.
  • 100.000 Lines/sec overall Throughput with 5 Nodes
  • Performance increasable by adding additional Nodes
Manual assessment Processing time can be influenced by adding / removing worker nodes to / from the cluster.
Actual evaluations links to acutual evaluations of this Issue/Scenario
Evaluation URL
issue issue Delete
hadoop hadoop Delete
webarchive webarchive Delete
characterisation characterisation Delete
unknown_characteristics unknown_characteristics Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.