Skip to end of metadata
Go to start of metadata

Collection:

Title
Austrian National Library - Web Archive
Description The Austrian National Library uses a representative datasets from their webarchive:
- events selective crawls: during an event frequently harvested sites, e.g. EU election 2009, Olympia 2010, 
- domain crawls 2009 from about 1 million domains.

The web archive data is available in the ARC.GZ format.
The size of the ARC.GZ data set is 1377GB.

The metadata log file produced during the crawl process is available as txt file and has a size of 197GB.
Licensing Sample only available to SCAPE partners.
Owner Austrian National Library (ONB)
Collection expert Prändl-Zika Veronika (ONB)
Issues brainstorm  
List of Issues IS25 Web Content Characterisation
IS41 Analyse huge text files containing information about a web archive

Issue:

Title
IS25 Web Content Characterisation
Detailed description
The issue with web content is mainly the fact that web archive data is very heterogeneous. Depending on the policy of the institution, data contains text documents in all kinds of text encoding, html content loosely following different HTML specifications, audio and video files that were encoded with a variety of codecs, etc.. But in order to take any decisions in preservation, it is undispensable to have detailed information about the content in the web archive, especially those pieces of information that preservation tools depend on.
It is not possible to perform a data migration without knowing exactly what kind of digital object is encountered in the collection and what are the logical and technical dependencies of the object. And it is not only necessary to identify the single objects contained in an ARC/WARC file, but also identify container formats, like packaged files or any other container formats. Video files, for example, are often available as so called wrapper formats, like AVI, where each, the audio and video stream, can be encoded using different codecs. Down to this level the content stream must be identified if the institutional policy would foresee to preserve all video and audio content contained in a web archive.
Furthermore, the issue has two different aspects, one is the challenge to identify content that is already known. In this sense, the main goal of identification is to identify the content correctly. The second aspect is unknown content in the web archive which is measured by the coverage of identification tools, where coverage indicates the part of the content that can be identified. Coverage depends on reliability in the sense that a bad reliability can hide a bad coverage in case that many objects are incorrectly identified, but are actually unknown. The challenge regarding this second aspect is to reach a precise set of the unknown objects in order to be able to derive a plan dealing exactly with these objects.
From a practical point of view, the challenge starts with the ARC/WARC file format that ONB and SB as the main stakeholders of this issue are using in their web archive. The Heritrix web crawler (https://webarchive.jira.com/wiki/display/Heritrix/Heritrix) produces these files as a result of the web crawls. The business logic and implementation is accessible - Heritrix is available as a collaborative code project at Github: https://github.com/internetarchive/heritrix3, but it has been integrated in the the web crawler, not in web content preservation workflows. This leads to the subordinate issue of dealing with ARC/WARC files as the basis of web content preservation workflows.
The last aspect of this issue is the fact that several tools are known to generally address these kinds of challenges, still integration of the tools provided by the work package PC.WP.1 must be ensured by integrating them into real life workflows.
Scalability Challenge
Billions of objects, hundres of Tbytes
Issue champion Bjarne Andersen (SB)
Markus Raditsch (ONB)
Other interested parties
 
Possible Solution approaches 1. Make Taverna workflows work with ARC/WARC container
2. Test / expand format coverage of different existing tools
Context
Lessons Learned  
Training Needs
Datasets State and University Library Denmark - Web Archive Data
Austrian National Library - Web Archive
Solutions SO07 Develop Warc Unpacker
SO11 The Tika characterisation Tool

Evaluation

Objectives This is about automation and scaleability due to the vast amounts of data. Currently over 7 billion objects in Netarchive.dk
Success criteria We will have a workflow that can characterize the content of a web archive within a reasonable time frame and with a reasonable correctness
Automatic measures 1. Process 50.000 objects per node per hour
2. Identify 95% of the objects correctly
Manual assessment  
Actual evaluations links to acutual evaluations of this Issue/Scenario

Solutions:

Title SO17 Web Archive Mime-Type detection workflow based on Droid and Apache Tika
Detailed description An experimental workflow has been implemented using Taverna Workbench. Due to the large amount of local data to be processed, the workflow is using locally running tools (instead of webservices). The workflow input port expects a text file containing a list of file paths to ARC.GZ files. While the workflow is executed, the GZ and ARC files are unpacked and analyzed in parallel. The result of the workflow is a summary report of the Mime-Type distribution inside all the ARC.GZ files.

Input of the workflow:
) A flat text file containing a list of ARC.GZ files to be analyzed.

Output of the workflow:
) One report file per ARC.GZ file, containing all Mime-Types plus a count on them in CSV format.
) One report file over all processed ARC.GZ files containing a normalized Mime-Type distribution list in XLS format.

The workflow exists in two versions:
) One using the TIFOWA tool (by ONB) utilizing the Apache TIKA 0.7 API.
) One using DROID 6.0.1 in command line mode.


Rough workflow walkthrough:
) Read input list of ARC.GZ files
) Unpack each GZ to ARC
) Tool: unpack each ARC to a temporary folder (flagged with a “taskID”)
) Tool: run the characterization (TIKA or DROID)
) Cleanup the temporary files (per iteration)
) Tool: wait for all iterations to be completed and generate a summary report over all partial reports

The summary output makes it easy to compare the results produced by the two different characterization core tools – if running on the same test set.
All steps are running in parallel (e.g. ARC10 characterization is running while ARC30 is unpacking while ARC01 has already been deleted from the temporary file system) – except the creation of the summary report.


Involved tools:

unARC:
A tool (by SB) to unpack ARC files.

TIFOWA (using the TIKA API):
TIFOWA (by ONB)  is using the TIKA API for extracting meta data from the files contained in an folder structure. It is creating a list of all detected "Content-Type" tags with the total number of occurrences. It is presenting the output in a format which can be easily imported as CSV data for furter processing in a spreadsheet program. Embedded in the described Taverna work flow, the result is a bunch of files containing the "Content-Type" distribution list for each ARC file.

DROID in command line mode:
In an external tool plugin in Taverna WB, the DROID jar is called twice. First to add the folder to be analyzed to a temporary DROID profile. Then to create the DROID CSV report for that profile. Afterwards we need to run a small tool (“csv2tifowa” by ONB) to pick the data we need from the DROID CSV and create an output similar to the output format we are using in TIFOWA (to be able to use the same tool for creating the summary report in the next step).

MergeTifowaReports:
A tool (by ONB) to normalize the characterization output (e.g. “UTF 8” => “utf8”) and to create on XLS summary report from all the partial reports created during workflow processing.



Solution Champion
Markus Raditsch (ONB)
Corresponding Issue(s)
IS25 Web Content Characterisation
myExperiment Link
Webarchive characterizer using Apache TIKA(TM) at myexperiment.org
Tool Registry Link

Evaluation

Labels:
scenario scenario Delete
webarchive webarchive Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Oct 11, 2012

    Whats the difference between this scenario and WCT3?

    1. Oct 12, 2012

      MR

      The dataset (collection) is different.

      1. Oct 23, 2012

        Right! Thanks.