This document describes a study that measured the speed of file conversions performed by the SCAPE API.
The SCAPE API is an Azure-based Web service created by Microsoft Research. The API, which is exposed via RESTful HTTP methods, can be used by programmers to perform various file-related tasks within custom applications. For example, you can use the API to do the following:
- Upload a file to SCAPE.
- Convert the uploaded file to another file format.
- Wait for the conversion to complete.
- Download the converted file.
The complete SCAPE API is documented in the “SCAPE API Guide” (Microsoft Research, 2014).
In this study, we used the SCAPE API to upload approximately 1,000 sample files to SCAPE, convert them to another format, and measure the time it takes for all conversions to complete.
A custom test program was created to communicate with the SCAPE API. The sample files that were uploaded by the test program were obtained from the Govdocs1 digital corpora at http://digitalcorpora.org/corpora/files.
See this report’s appendix for a detailed description of the study procedure.
The following table shows the average conversion time per file for the five kinds of file conversions that were performed in this study.
Table 1: Average conversion times
Here are a few points to note about the results.
The last column in the table shows the number of conversions that were reported as “failed” by the SCAPE API. SCAPE conversions can fail when a converter rejects an uploaded file for some reason, or a network glitch occurs. No attempt was made to determine the exact cause of the conversion failures in this study.
The failed conversions were included in the total and average calculations in the table.
The SCAPE API can perform many more conversions than were measured here---112, to be exact. However, SCAPE uses only two file converters for the conversions that are available via the API, and the set of conversions selected for this study made use of both converters. Measurements for the other available conversions would likely yield results with the same order of magnitude as the results shown here.
The average conversion times measured here reflect the behavior of the SCAPE servers when a large set of files is converted via the SCAPE API, all at the same time.
Because SCAPE’s file converters use a complex scheduling algorithm involving polling, batching and multithreading, the results might differ if only a few files are converted. For example, Word Automation Services polls for conversion requests only once per minute. If you convert just one DOC file to DOCX, the conversion (which is performed by Word Automation Services) could actually take longer than a minute, depending on where in the polling cycle your request was made. The effect of these polling delays diminishes as larger numbers of files are converted simultaneously.
The results shown above could easily be improved by scaling out the servers in SCAPE’s backend system. During this study, the SCAPE backend used only one server hosting Word Automation Services and one server hosting the Open XML / ODF Translator. A big advantage of hosting SCAPE in Microsoft Azure is that Azure makes it easy to deploy additional servers and distribute API requests among them.
For example, ten servers could be used to host ten instances of the Open XML / ODF Translator, and because these servers operate independently, we would expect the average conversion time for a DOC®ODT conversion to be roughly one-tenth of the 2.58 seconds reported here, or about 0.26 seconds.
Additional servers could also be used to create a “SharePoint server farm” to host multiple instances of Word Automation Services. (Word Automation Services is a part of SharePoint.) The servers in such a farm aren’t entirely independent and so the scaling behavior isn’t perfectly linear, but the results for the other conversions in this study (DOC®DOCX, DOC®RTF, DOC®PDF and DOCX®DOC.) would still improve significantly if Word Automation Services were deployed on multiple servers.
The DOC files used for the first three conversions in this study (DOC®DOCX, DOC®RTF and DOC®PDF) were obtained from the Govdocs1 digital corpora at http://digitalcorpora.org/corpora/files. The ten files called “subset0.zip” through “subset9.zip” were downloaded to the test machine and unzipped, and all the DOC files in the unzipped subsets were used as uploaded files for the first three conversions. There were 1,080 such DOC files.
The final two conversions (DOCX®DOC, DOCX®ODT) required DOCX files, but there were only 352 DOCX files in the Govdocs1 digital corpora. To work around this, the DOCX files created in this study’s first conversion (DOC®DOCX) were used as the uploaded files for the final two conversions. There were 1,049 such DOCX files.
The test program was a custom C# console application that performed the following steps:
- Upload all files (or half of them; see below) using the “Upload File” method in the SCAPE API.
- Start a timer.
- For each uploaded file, convert the file to the other format using the “Convert Uploaded File” method in the SCAPE API.
- For each uploaded file, request the state of the converted file using the “Get Converted File List” method in the SCAPE API.
- Pause for 60 seconds, then repeat the previous step until all converted files are converted.
- Stop the timer.
- Report the results.
For the DOCX®ODT conversion, which involves the Open XML / ODF Translator, a large number of conversion failures caused by interprocess communication timeouts occurred when the entire set of files was converted at once. Rather than increase the timeout value, the files were divided into two batches and the batches were converted separately, which prevented the timeouts. These timeout failures require further investigation.
During the study, the SCAPE backend was configured with the following servers:
- One “Standard_A1 (1 core, 1.75 GB memory)” Azure web role servicing the SCAPE API requests.
- One “Standard_A3 (4 cores, 7 GB memory)” Azure virtual machine hosting SharePoint with Word Automation Services.
- One “Standard_A1 (1 core, 1.75 GB memory)” Azure worker role hosting the Open XML / ODF Translator.
The test program was run on a “Standard_A2 (2 cores, 3.5 GB memory)” virtual machine created for this study. The test program was not CPU-intensive, so the size of the test machine wasn’t important.
Microsoft Research. “SCAPE API Guide,” 2014. Archived in the Microsoft Research SCAPE TFS repository at $/CML/Projects/Scape/Documentation/ApiService/ScapeApiGuide.docx.