Version 2 by Ivan Vujic
on May 22, 2014 18:20.

compared with
Current by Rune Bruun Ferneke-Nielsen
on Jul 15, 2014 11:04.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (7)

View Page History


h5. Assessment of measurable points Tika

|| Metric || Description || Metric baseline || Metric goal || _February 04, 2014_ || _May 21, 2014_ || _evaluation date_ ||
| NumberOfObjectsPerHour | Number of objects processed in one hour  \\ | | | 2371809 | 2447712 | |
| MinObjectSizeHandledInGbytes | Smallest ARC file in sample  \\ | | | 0.000000026 \\ | 0.000000026 \\ | |
| MaxObjectSizeHandledInGbytes | Biggest ARC file in sample  \\ | | | 0.347421616 | 0.347421616 \\ | |
| ThroughputGbytesPerMinute | The throughput of data measured in Gybtes per minute | | | 20.0 | 20.7 | |
| ThroughputGbytesPerHour | The throughput of data measured in Gbytes per hour | | | 1201.8 | 1240.2 | |
| AverageRuntimePerItemInSeconds | The average processing time in seconds per item | | | 0.001517829 | 0.001470761 | |

h5. Assessment of measurable points DROID

|| Metric || Description || Metric baseline || Metric goal || _February 04, 2014_ || _May 21, 2014_ || _evaluation date_ ||
| NumberOfObjectsPerHour | Number of objects processed in one hour  \\ | | | 234472 | 217809 | |
| MinObjectSizeHandledInGbytes | Smallest ARC file in sample  \\ | | | 0.000000026 \\ | 0.000000026 \\ | |
| MaxObjectSizeHandledInGbytes | Biggest ARC file in sample  \\ | | | 0.347421616 | 0.347421616 \\ | |
| ThroughputGbytesPerMinute | The throughput of data measured in Gybtes per minute | | | 20.0 | 20.7 | |
| ThroughputGbytesPerHour | The throughput of data measured in Gbytes per hour | | | 118.8 | 110.4 | |
| ReliableAndStableAssessment | Manual assessment on if the experiment performed reliable and stable | | | true | true | |
| NumberOfFailedFiles | Number of files that failed in the workflow | | | N/A | 1727 | |
| AverageRuntimePerItemInSeconds | The average processing time in seconds per item | | | 0.015353637 | 0.016528232 | |


h5. Assessment of measurable points MD5

|| Metric || Description || Metric baseline || Metric goal || _February 04, 2014_ || _May 21, 2014_ || _evaluation date_ ||
| NumberOfObjectsPerHour | Number of objects processed in one hour  \\ | | | 1532443 | 1340888 | |
| MinObjectSizeHandledInGbytes | Smallest ARC file in sample  \\ | | | 0.000000026 \\ | 0.000000026 \\ | |
| MaxObjectSizeHandledInGbytes | Biggest ARC file in sample  \\ | | | 0.347421616 | 0.347421616 \\ | |
| ThroughputGbytesPerMinute | The throughput of data measured in Gybtes per minute | | | 12.9 | 11.3 | |
| ThroughputGbytesPerHour | The throughput of data measured in Gbytes per hour | | | 776.5 | 679.4 | |
| ReliableAndStableAssessment | Manual assessment on if the experiment performed reliable and stable | | | true | true | |
| NumberOfFailedFiles | Number of files that failed in the workflow | | | N/A | 0 | |
| AverageRuntimePerItemInSeconds | The average processing time in seconds per item | | | 0.00234919 | 0.002684789 | |


h1. Interpretation

The huge difference in the results for the MD5 utility-\--more utility more than an order of magnitude-\--indicates magnitude indicates that a much faster file system was used on the Azure VM.  Indeed, the Govdocs1 files were put directly on the VM’s hard drive, whereas the files in the on-site study were mounted on a separate Network File System server accessed over a network.  If the files in the earlier study had been put directly on the server (which probably wasn’t possible), the on-site performance would likely have been much better.

(And it must be noted that if one of the other storage options for Azure had been used-\--BLOB used BLOB storage or SQL Server database, for example-\--the example the Azure performance would likely have been much worse.)

This suggests that if a very large collections of documents needs to be identified quickly, the location of the files is important.