||Identifying and content and Sorting|
|Detailed description|| A detailed description of the Issue. The Issue MUST focus on the busines or preservation driven challenge, and should not assume or describe a particular solution.
Gathering information about content, including file identifcation and metadata, to inform:
-sorting and processing unappraised material
-preservation planning .
2. Image files are the second largest group in the data set: approx 20,000.
What information can we gather to process further:
What metadata can we extract?
This issue question commonality with Jodie's dataset.
|Issue champion|| Who owns the issue? Include an email address if possible
Ifor ap Dafydd, [email protected]
| Other interested parties
|| Jodie Double
|Possible Solution approaches|| Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list
‘File’, (Unix command script) to identify content of disk (path, title, author, title, date of creation, date of last modification)
Partitioning and sorting in the spreadsheet
exiftool for collecting more metadata
|Context|| Details of the institutional context to the Issue. (May be expanded at a later date)
Institution is currently working towards specifying requirements for an ingest workflow for born digital materials.
These unappraised - rescued materials - are an example of a legacy collection, which while safeguarded in Digital Archive has not been processed, described or made available for access.
|Lessons Learned|| Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Day 1 – Dave and Michael; Jodie, Ifor
What have we done: identify content of disk, formats and other metadata
What did we use: -Run ‘File’, (Unix command script) to identify content of disk (path, title, author, title, date of creation, date of last modification)
What did we find:
-“Top ten” file types form vast majority of material;
long tail of other formats e.g. 38,000 Word docs; 21 databases
Data for ‘top ten’ (9?) placed in Excel spreadsheet.
Why is File the best tool/way of doing this?
Are there other options?
Tool for assisting in sorting the data?
Extracting further metadata?
Selecting or preparing for ingest?
Day 2: What we looked at: 5736 directories
1. Excel spreadsheet for 'top ten' formats - contains valuable metadata but unweildy to look at.
4th most numerous file type on the list is Unknown.
1000 unknown results -- these could be 1000 of same thing?
What is the next factor or term to sort the data?
Assume that we want to keep all the data: therefore perhaps consider searching for best ways to weed out material?
-Dataset includes Dbase index file, and other file format types to be shared with Andrew for DROID registry identification
|Datasets|| Sgrin Archive
Ida Roper Herbarium archive
Leeds image duplicates and versions
|Solutions|| Reference to the appropriate Solution page(s), by hyperlink
Open Planets Foundation - File Scanner
Skip to end of metadata Go to start of metadata