File management and matching of tif, htm and pdf files solution

Skip to end of metadata
Go to start of metadata
Title File management and matching of tif, htm and pdf files solution
Detailed description Checking tool for OCR MASTER template.xls

The solution (above) is an Excel (2003) spreadsheet which has some VBA code/macros to perform the comparison and matching that Francine described. In this case, the solution is quite specific to Francine's issue, due to the specificity of her requirements (including working with an existing spreadsheet of a specific format, and exact requirements regarding what file comparisons should be done.)

Excel may seem like an odd choice, but it is a very powerful data-manipulation tool, and it fits very naturally into the NLS toolset, since other related work is performed with Access databases and other Excel macros.

The macro uses the existing data in the spreadsheet (representing TIFF filenames), and then queries a (user-specified) folder, to check that there are corresponding .pdf and .htm files. (Using the VBA FileSystemObject for querying local files and folders -- while the solution has only been tested on a local disk, there's not expected to be any issues with using it on a Windows network share.) Most of the code is mechanical, to do with querying the filesystem, performing simple comparisons, and Excel-manipulation boilerplate.

(Also, as a result of this work, similar issues were raised by other practitioners, and similar (but different) solutions were provided. See the Github repo for more information. This also lays the groundwork for, and opens up the possibility of, a more general approach to solving this class of issues with a single tool.)
Solution Champion Andrew Amato
Corresponding Issue(s) Disassociation of files and metadata
Tool/code link Relevant subfolder of SPRUCE Github repository
Tool Registry Link http://wiki.opf-labs.org/display/TR/Resource+Audit+and+Comparison+Tool+%28ReACT%29
Evaluation From the Solution owner:
The solution worked well and was developed in just over a morning. The main problems foreseen were the developer had a newer version of Excel than the practitioner and that the tool may not work on items stored on a network. The checking tool worked over several attempts on a sample of 204 file pairs, some of which were altered in name and extension so they would not all match up. The results were quickly displayed on the NLS' spreadsheet which is the master list for each batch of pages digitised so it fits in nicely with current workflow and practices.
The checking tool is now ready to go and be used in the NLS for checking OCR outputs and Francine is confident it can work on networked folders as she uses similar VB tools for renaming files.
The checking tool was seen by another practitioner and she had a similar one developed by Andrew to match original and preserved files. It can be adjusted by Andrew to suit different collections which was not foreseen by Francine. The tool has no guaranteed sustainability, but in the NLS there are several staff members who could help maintain and tweak it.

Andrew: In addition, there's more scope for generalisation and abstraction. The two other issues (from ADS and PRONI) were quite similar to Francine's requirements, but were developed separately due to practicality and time-restraints. In general, though, the issues are the same: comparing lists of files. In this case, it's one Excel list and a folder on disk; for the ADS case, it was two folders on disk; and for the PRONI case, it is two (or more) lists in Excel. It is hoped that this could be abstracted away into a generic comparison tool, with enough tweakability to meet most of these types of use-cases. (For which a starting point would be these spreadsheets.)
From Paul and report back session (specifically regarding original solution):
  • Thorough, working solution.
  • Several other practitioners immediately voiced a similar need (see notes above)
  • Very clear reporting of missing/unexpected files
  • Cross links in to unexpected files, enables immediate investigation
  • Issue champion: Smiley face! Can be applied to run large batches. Local technical experience can apply any tweaks if necessary. This solves the problem!
    From Paul and report back session (specifically regarding adapted solution for ADS):
  • Enables browsing for source and destination directories
  • Then matches files and reports issues
  • Issue owner: Very useful additional check that no files have gone missing as part of a preservation management process.
  • Would be great to develop a slightly more generic version that could be picked up and used in other organisations with adaptation.
  • Double check needed that it works fine on a Windows fileshare
Labels:
management management Delete
excel excel Delete
pdf pdf Delete
htm htm Delete
tif tif Delete
matching matching Delete
ocr ocr Delete
tool tool Delete
tiff tiff Delete
visual visual Delete
basic basic Delete
macro macro Delete
spruce spruce Delete
spruce_glasgow spruce_glasgow Delete
solution solution Delete
structural_relationships structural_relationships Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.