Title | File management and matching of tif, htm and pdf files solution |
Detailed description | Checking tool for OCR MASTER template.xls![]() The solution (above) is an Excel (2003) spreadsheet which has some VBA code/macros to perform the comparison and matching that Francine described. In this case, the solution is quite specific to Francine's issue, due to the specificity of her requirements (including working with an existing spreadsheet of a specific format, and exact requirements regarding what file comparisons should be done.) Excel may seem like an odd choice, but it is a very powerful data-manipulation tool, and it fits very naturally into the NLS toolset, since other related work is performed with Access databases and other Excel macros. The macro uses the existing data in the spreadsheet (representing TIFF filenames), and then queries a (user-specified) folder, to check that there are corresponding .pdf and .htm files. (Using the VBA FileSystemObject for querying local files and folders -- while the solution has only been tested on a local disk, there's not expected to be any issues with using it on a Windows network share.) Most of the code is mechanical, to do with querying the filesystem, performing simple comparisons, and Excel-manipulation boilerplate. (Also, as a result of this work, similar issues were raised by other practitioners, and similar (but different) solutions were provided. See the Github repo for more information. This also lays the groundwork for, and opens up the possibility of, a more general approach to solving this class of issues with a single tool.) |
Solution Champion | Andrew Amato |
Corresponding Issue(s) | Disassociation of files and metadata |
Tool/code link | Relevant subfolder of SPRUCE Github repository![]() |
Tool Registry Link![]() |
http://wiki.opf-labs.org/display/TR/Resource+Audit+and+Comparison+Tool+%28ReACT%29![]() |
Evaluation | From the Solution owner: The solution worked well and was developed in just over a morning. The main problems foreseen were the developer had a newer version of Excel than the practitioner and that the tool may not work on items stored on a network. The checking tool worked over several attempts on a sample of 204 file pairs, some of which were altered in name and extension so they would not all match up. The results were quickly displayed on the NLS' spreadsheet which is the master list for each batch of pages digitised so it fits in nicely with current workflow and practices. The checking tool is now ready to go and be used in the NLS for checking OCR outputs and Francine is confident it can work on networked folders as she uses similar VB tools for renaming files. The checking tool was seen by another practitioner and she had a similar one developed by Andrew to match original and preserved files. It can be adjusted by Andrew to suit different collections which was not foreseen by Francine. The tool has no guaranteed sustainability, but in the NLS there are several staff members who could help maintain and tweak it. Andrew: In addition, there's more scope for generalisation and abstraction. The two other issues (from ADS and PRONI) were quite similar to Francine's requirements, but were developed separately due to practicality and time-restraints. In general, though, the issues are the same: comparing lists of files. In this case, it's one Excel list and a folder on disk; for the ADS case, it was two folders on disk; and for the PRONI case, it is two (or more) lists in Excel. It is hoped that this could be abstracted away into a generic comparison tool, with enough tweakability to meet most of these types of use-cases. (For which a starting point would be these spreadsheets.) From Paul and report back session (specifically regarding original solution):
|
Labels: