Skip to end of metadata
Go to start of metadata

Status

Active

Contact

Ivan Vujic - ivujic@microsoft.com

User story

As part of evaluating what platform user should be using to run characterization and identification tools, user should SCAPE Azure platform. We measured the speed of the Apache Tika Content Analysis Toolkit and the DROID File Format Identification Tool when they were run on a Microsoft Azure virtual machine.  The results are compared to the speed of the same tools running on a traditional on-site server.

Datasets

The documents used were obtained from the Govdocs1 digital corpora at http://digitalcorpora.org/corpora/files. The ten files called “subset0.zip” through “subset9.zip” were downloaded to the virtual machine and unzipped, and the extensions were removed from all the file names: “3865.pdf” became “3865”, for example.  This was consistent with the earlier evaluations, which removed the extensions to test Tika and DROID’s ability to identify formats without using file extension analysis.

There were 11,919 such documents.

Experiments

Tika and DROID running on SCAPE Azure Platform.

Approach, results and reasoning can be found on this page: Characterisation and Identification on SCAPE Azure Platform

Developer notes

This story provides the opportunity to compare tools between platforms, tools such as the Tika and DROID as performed by PC.CC. (See also Incubator)

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.