View Source

h1. Introduction

This document describes a study that measured the speed of the Apache Tika Content Analysis Toolkit and the DROID File Format Identification Tool when they were run on a Microsoft Azure virtual machine.  The results are compared to the speed of the same tools running on a traditional on-site server.

h1. Background

The Apache Tika Content Analysis Toolkit (or “Tika,” for brevity) is Java-based software that identifies the format of a file and extracts metadata from it.  It’s available at [http://tika.apache.org/|http://tika.apache.org/].  The DROID File Format Identification Tool (or “DROID”) is Java-based software that performs batch identification of file formats.  It’s available at [http://www.nationalarchives.gov.uk/information-management/projects-and-work/droid.htm|http://www.nationalarchives.gov.uk/information-management/projects-and-work/droid.htm].  Tika and DROID are important tools in the SCAPE digital content preservation project.

In an earlier study, SCAPE researchers evaluated these tools and measured their performance (Radtisch, May, Blekinge and Møldrup-Dalum, 2012).  The ability of Tika and DROID to accurately identify file formats was determined by running them against a set of approximately one million files in the Govdocs1 corpus and comparing the results to a _ground truth_ provided by Forensic Innovations, Inc.  The speed of the tools was measured while running them on an on-site server.

This newer study measured the speed of these tools when they were run in a cloud environment, namely on a Microsoft Azure virtual machine (VM).  (No effort was made to rerun the accuracy tests, as there is no reason to believe that the accuracy of the same tools would vary on different servers.)  Tika and DROID are used today in an Azure-based SCAPE API Service implemented by Microsoft Research and might be used in future Azure-based projects, so the speed of the tools when running in the cloud is an important consideration.

h1. Procedure

The procedure this study used to measure the speed of the tools adhered reasonably closely to what was done in the earlier study.  The one concession to expediency was the use of a subset of about 12,000 randomly chosen documents instead of the entire corpus of one million.

h1. Results

The speed at which Tika and DROID identified file formats is shown in Table 1 below, along with the speed of an MD5 calculation utility operating on the same files.   Calculating a file’s MD5 checksum involves reading the entire file while performing very few calculations, so the MD5 numbers effectively compare file access times between the two servers.

The results are in files per second, so larger numbers are better.

(The on-site server numbers were read from a bar chart in the earlier report and are approximate.)
| * * | *On-Site* \\
*Server* | *Azure* \\
*VM* |
| *TIka* | 61 | 659 |
| *DROID* | 47 | 65 |
| *MD5* | 42 | 426 |
Table 1: Files per second

h1. Evaluation points


h5. Assessment of measurable points Tika

|| Metric || Description || Metric baseline || Metric goal || _February 04, 2014_ || _May 21, 2014_ || _evaluation date_ ||
| NumberOfObjectsPerHour | Number of objects processed in one hour  \\ | | | 2371809 | 2447712 | |
| MinObjectSizeHandledInGbytes | Smallest ARC file in sample  \\ | | | 0.000000026 \\ | 0.000000026 \\ | |
| MaxObjectSizeHandledInGbytes | Biggest ARC file in sample  \\ | | | 0.347421616 | 0.347421616 \\ | |
| ThroughputGbytesPerMinute | The throughput of data measured in Gybtes per minute | | | 20.0 | 20.7 | |
| ThroughputGbytesPerHour | The throughput of data measured in Gbytes per hour | | | 1201.8 | 1240.2 | |
| ReliableAndStableAssessment | Manual assessment on if the experiment performed reliable and stable | | | true | true | |
| NumberOfFailedFiles | Number of files that failed in the workflow | | | N/A | 0 | |
| AverageRuntimePerItemInSeconds | The average processing time in seconds per item | | | 0.001517829 | 0.001470761 | |

h5. Assessment of measurable points DROID

|| Metric || Description || Metric baseline || Metric goal || _February 04, 2014_ || _May 21, 2014_ || _evaluation date_ ||
| NumberOfObjectsPerHour | Number of objects processed in one hour  \\ | | | 234472 | 217809 | |
| MinObjectSizeHandledInGbytes | Smallest ARC file in sample  \\ | | | 0.000000026 \\ | 0.000000026 \\ | |
| MaxObjectSizeHandledInGbytes | Biggest ARC file in sample  \\ | | | 0.347421616 | 0.347421616 \\ | |
| ThroughputGbytesPerMinute | The throughput of data measured in Gybtes per minute | | | 20.0 | 20.7 | |
| ThroughputGbytesPerHour | The throughput of data measured in Gbytes per hour | | | 118.8 | 110.4 | |
| ReliableAndStableAssessment | Manual assessment on if the experiment performed reliable and stable | | | true | true | |
| NumberOfFailedFiles | Number of files that failed in the workflow | | | N/A | 1727 | |
| AverageRuntimePerItemInSeconds | The average processing time in seconds per item | | | 0.015353637 | 0.016528232 | |


h5. Assessment of measurable points MD5

|| Metric || Description || Metric baseline || Metric goal || _February 04, 2014_ || _May 21, 2014_ || _evaluation date_ ||
| NumberOfObjectsPerHour | Number of objects processed in one hour  \\ | | | 1532443 | 1340888 | |
| MinObjectSizeHandledInGbytes | Smallest ARC file in sample  \\ | | | 0.000000026 \\ | 0.000000026 \\ | |
| MaxObjectSizeHandledInGbytes | Biggest ARC file in sample  \\ | | | 0.347421616 | 0.347421616 \\ | |
| ThroughputGbytesPerMinute | The throughput of data measured in Gybtes per minute | | | 12.9 | 11.3 | |
| ThroughputGbytesPerHour | The throughput of data measured in Gbytes per hour | | | 776.5 | 679.4 | |
| ReliableAndStableAssessment | Manual assessment on if the experiment performed reliable and stable | | | true | true | |
| NumberOfFailedFiles | Number of files that failed in the workflow | | | N/A | 0 | |
| AverageRuntimePerItemInSeconds | The average processing time in seconds per item | | | 0.00234919 | 0.002684789 | |


h1. Interpretation

The huge difference in the results for the MD5 utility more than an order of magnitude indicates that a much faster file system was used on the Azure VM.  Indeed, the Govdocs1 files were put directly on the VM’s hard drive, whereas the files in the on-site study were mounted on a separate Network File System server accessed over a network.  If the files in the earlier study had been put directly on the server (which probably wasn’t possible), the on-site performance would likely have been much better.

(And it must be noted that if one of the other storage options for Azure had been used BLOB storage or SQL Server database, for example the Azure performance would likely have been much worse.)

This suggests that if a very large collections of documents needs to be identified quickly, the location of the files is important.

It’s likely that the difference in Tika performance is also due primarily to storage differences.  However, the DROID numbers seem odd, at least at first glance.  DROID ran faster on the Azure VM, but certainly not an order of magnitude faster.  Further investigation would be required to prove this, but a possible explanation is that MD5 and Tika are IO-bound, while DROID is CPU-bound.  In other words, DROID spends more time calculating than reading, so much so that even a 10X increase in reading speed results in only a moderate increase in overall speed.

Note that different CPUs were used in the two servers: The on-site server used dual Xeon X5670 CPUs running at 2.93 GHz, while the Azure VM used two cores of a six-core AMD Opteron 4171 HE running at 2.09 GHz.  It’s not clear which should be faster, but in any case the storage differences in the two studies clearly had a much larger effect that any CPU speed differences.

h1. Conculsions

The results can be easily summarized: Tika and DROID run fast on a Microsoft Azure VM, provided that the files that need to be identified are local to the machine.

A few additional facts became apparent during the study.  First, Tika and DROID can be easily used in the Microsoft Azure environment.  They were run on an Azure VM for this study, and they are currently used in an Azure Web Role in the SCAPE API Service.  (VMs are part of Azure’s infrastructure as a service offering, while Web Roles and Worker Roles are part of Azure’s platform as a service offering.)  In both cases, no serious problems were encountered while deploying the tools.

Second, the scalability inherent in Azure makes it easy to allocate hardware as needed.  If the VM’s hard disk isn’t large enough to hold the files that need to be identified, a larger disk can easily be attached.  Or if one VM is inadequate, duplicate VMs can be easily created.  Those are manual steps, but scaling can also be done dynamically in response to changing demands by utilizing Azure’s auto-scaling features.  This is all much easier than trying to scale up physical hardware in an on-site environment.