Skip to end of metadata
Go to start of metadata
Title SO17 Web Archive Mime-Type detection workflow based on Droid and Apache Tika
Detailed description An experimental workflow has been implemented using Taverna Workbench. Due to the large amount of local data to be processed, the workflow is using locally running tools (instead of webservices). The workflow input port expects a text file containing a list of file paths to ARC.GZ files. While the workflow is executed, the GZ and ARC files are unpacked and analyzed in parallel. The result of the workflow is a summary report of the Mime-Type distribution inside all the ARC.GZ files.

Input of the workflow:
) A flat text file containing a list of ARC.GZ files to be analyzed.

Output of the workflow:
) One report file per ARC.GZ file, containing all Mime-Types plus a count on them in CSV format.
) One report file over all processed ARC.GZ files containing a normalized Mime-Type distribution list in XLS format.

The workflow exists in two versions:
) One using the TIFOWA tool (by ONB) utilizing the Apache TIKA 0.7 API.
) One using DROID 6.0.1 in command line mode.

Rough workflow walkthrough:
) Read input list of ARC.GZ files
) Unpack each GZ to ARC
) Tool: unpack each ARC to a temporary folder (flagged with a “taskID”)
) Tool: run the characterization (TIKA or DROID)
) Cleanup the temporary files (per iteration)
) Tool: wait for all iterations to be completed and generate a summary report over all partial reports

The summary output makes it easy to compare the results produced by the two different characterization core tools – if running on the same test set.
All steps are running in parallel (e.g. ARC10 characterization is running while ARC30 is unpacking while ARC01 has already been deleted from the temporary file system) – except the creation of the summary report.

Involved tools:

A tool (by SB) to unpack ARC files.

TIFOWA (using the TIKA API):
TIFOWA (by ONB)  is using the TIKA API for extracting meta data from the files contained in an folder structure. It is creating a list of all detected "Content-Type" tags with the total number of occurrences. It is presenting the output in a format which can be easily imported as CSV data for furter processing in a spreadsheet program. Embedded in the described Taverna work flow, the result is a bunch of files containing the "Content-Type" distribution list for each ARC file.

DROID in command line mode:
In an external tool plugin in Taverna WB, the DROID jar is called twice. First to add the folder to be analyzed to a temporary DROID profile. Then to create the DROID CSV report for that profile. Afterwards we need to run a small tool (“csv2tifowa” by ONB) to pick the data we need from the DROID CSV and create an output similar to the output format we are using in TIFOWA (to be able to use the same tool for creating the summary report in the next step).

A tool (by ONB) to normalize the characterization output (e.g. “UTF 8” => “utf8”) and to create on XLS summary report from all the partial reports created during workflow processing.

Solution Champion
Markus Raditsch (ONB)
Corresponding Issue(s)
IS25 Web Content Characterisation
myExperiment Link
Webarchive characterizer using Apache TIKA(TM) at
Tool Registry Link


solution solution Delete
identification identification Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.