View Source

| *Title* \\ | India Papers Collection \\ |
| *Description* | The Medical History of British India website [http://digital.nls.uk/indiapapers/] allows full browsing and searching of 330 medical volumes from the National Library of Scotland's India Papers collection. \\
There are 3 phases online, with a further 2 phases (20,000 pages each) in the process of OCR and export to xml for public release. \\
\\
One phase of the project is around 20-40,000 pages and this is split into batches of 5-7,000 pages. Each single page is listed in an Access database and then lists are produced in Excel for the microfilm and digitisation vendor. \\
From the microfilm, tif images are produced and from each tif image a pdf and htm file is produced (Optical Character Recognition). \\
The original metadata and the numbers of each Access page and tif file are held on an master Excel sheet. \\
\\
I wonder if I there is a tool to audit/inventory the pdf files and htm files and see if they match (each page has the same number but with a different file extension) the tif numbers before they are handed over to the NLS Digital Library team. This would be ideal in Excel format so it could be pasted into or next to the master Excel sheet. \\
\\
Current tools which I use are: Access to Excel export via Visual Basic, the Renamer and TreeSize Professional. \\ \\
It is vital that all the derivatives match the master tif number as whole books are shown on the website from cover to cover. \\
It would be timesaving and useful if a computer tool can identify any mismatches or missing files. \\ |
| *Licensing* | Many images are already available online for free. The sample files have restrictions as they have not been released publically yet. \\ |
| *Owner* | Owner of images is National Library of Scotland \\ |
| *Dataset Location* | On external USB hard drive brought to mashup \\ |
| *Collection Champion* | ___[~francinemillard]_ |
| *Issues brainstorm* | * Checking file names match in Excel - a tool to do this
* Inventory/audit of pdf and htm files from Windows Explorer to Excel spreadsheet |
| *Link to issues* \\ | [SPR:Disassociation of files and metadata] | |