Ivan Vujic - [email protected]
As part of evaluating what platform user should be using to run characterization and identification tools, user should SCAPE Azure platform. We measured the speed of the Apache Tika Content Analysis Toolkit and the DROID File Format Identification Tool when they were run on a Microsoft Azure virtual machine. The results are compared to the speed of the same tools running on a traditional on-site server.
The documents used were obtained from the Govdocs1 digital corpora at http://digitalcorpora.org/corpora/files. The ten files called “subset0.zip” through “subset9.zip” were downloaded to the virtual machine and unzipped, and the extensions were removed from all the file names: “3865.pdf” became “3865”, for example. This was consistent with the earlier evaluations, which removed the extensions to test Tika and DROID’s ability to identify formats without using file extension analysis.
There were 11,919 such documents.
Tika and DROID running on SCAPE Azure Platform.
Approach, results and reasoning can be found on this page: Characterisation and Identification on SCAPE Azure Platform
This story provides the opportunity to compare tools between platforms, tools such as the Tika and DROID as performed by PC.CC. (See also Incubator)