Skip to end of metadata
Go to start of metadata

Collection:

Title
Austrian National Library - Web Archive
Description The Austrian National Library uses a representative datasets from their webarchive:
- events selective crawls: during an event frequently harvested sites, e.g. EU election 2009, Olympia 2010, 
- domain crawls 2009 from about 1 million domains.

The web archive data is available in the ARC.GZ format.
The size of the ARC.GZ data set is 1377GB.

The metadata log file produced during the crawl process is available as txt file and has a size of 197GB.
Licensing Sample only available to SCAPE partners.
Owner Austrian National Library (ONB)
Collection expert Prändl-Zika Veronika (ONB)
Issues brainstorm  
List of Issues IS25 Web Content Characterisation
IS41 Analyse huge text files containing information about a web archive

Issue:

Title
IS41 Analyse huge text files containing information about a web archive
Detailed description Some web archive produce information about the content of a web archive on a periodical basis. The result is sometimes stored as huge text files. The processing of these text files is very time consuming, it is difficult to extract statistics and doing analysis on these text files.
Scalability Challenge
Depending on the size of the text files and the type of processing a scalability challenge might exist.
Issue champion Markus Raditsch (ONB)
Other interested parties
 
Possible Solution approaches Hadoop is designed for analysing huge text files, using the MapReduce programming model should therefore be considered.
Context The web archive at the Austrian National Library produces huge text files as the result of the web harvesting process.
Lessons Learned
Training Needs Hadoop training.
Datasets Austrian National Library - Web Archive
Solutions SO27 Analyse huge text files containing information about a web archive using Hadoop

Evaluation

Objectives Automation, scalability
Success criteria The technical implementation will make it possible to analyze statistical date stored in large text files and provide an appropriate output (e.g. total size of all "image/gif" objects related to a certain crawlID). The input text files will be provided by a web crawler system in a specific format.
Automatic measures In a similar "traditional setup" (single quadcore server in sequential processing ) we measured a throughput of around 26.500 lines/second.
Goal:
  • 100.000 Lines/sec overall Throughput with 5 Nodes
  • Performance increasable by adding additional Nodes
Manual assessment Processing time can be influenced by adding / removing worker nodes to / from the cluster.
Actual evaluations links to acutual evaluations of this Issue/Scenario
Evaluation URL
EVAL-WCT8-1

Solutions:

Title SO27 Analyse huge text files containing information about a web archive using Hadoop
Detailed description Analyse huge text files containing information about a web archive using Hadoop
Solution Champion
Markus Raditsch (ONB)
Corresponding Issue(s)
myExperiment Link

Tool Registry Link

Evaluation

Labels:
scenario scenario Delete
hadoop hadoop Delete
characterisation characterisation Delete
webarchive webarchive Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Oct 11, 2012

    Is this a preservation action?!