India Papers Collection

Skip to end of metadata
Go to start of metadata
Title
India Papers Collection
Description The Medical History of British India website http://digital.nls.uk/indiapapers/ allows full browsing and searching of 330 medical volumes from the National Library of Scotland's India Papers collection.
There are 3 phases online, with a further 2 phases (20,000 pages each) in the process of OCR and export to xml for public release.

One phase of the project is around 20-40,000 pages and this is split into batches of 5-7,000 pages. Each single page is listed in an Access database and then lists are produced in Excel for the microfilm and digitisation vendor.
From the microfilm, tif images are produced and from each tif image a pdf and htm file is produced (Optical Character Recognition).
The original metadata and the numbers of each Access page and tif file are held on an master Excel sheet.

I wonder if I there is a tool to audit/inventory the pdf files and htm files and see if they match (each page has the same number but with a different file extension) the tif numbers before they are handed over to the NLS Digital Library team. This would be ideal in Excel format so it could be pasted into or next to the master Excel sheet.

Current tools which I use are: Access to Excel export via Visual Basic, the Renamer and TreeSize Professional.

It is vital that all the derivatives match the master tif number as whole books are shown on the website from cover to cover.
It would be timesaving and useful if a computer tool can identify any mismatches or missing files.
Licensing Many images are already available online for free. The sample files have restrictions as they have not been released publically yet.
Owner Owner of images is National Library of Scotland
Dataset Location On external USB hard drive brought to mashup
Collection Champion Francine Millard
Issues brainstorm
  • Checking file names match in Excel - a tool to do this
  • Inventory/audit of pdf and htm files from Windows Explorer to Excel spreadsheet
Link to issues
Disassociation of files and metadata  
Labels:
management management Delete
excel excel Delete
pdf pdf Delete
tif tif Delete
matching matching Delete
tiff tiff Delete
html html Delete
spruce spruce Delete
spruce_glasgow spruce_glasgow Delete
dataset dataset Delete
image image Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.