View Source

| *Title* \\ | Identifying and content and Sorting |
| *Detailed description* | _A detailed description of the Issue. The Issue_ *{_}MUST{_}* _focus on the busines or preservation driven challenge, and should not assume or describe a particular solution._ \\
Gathering information about content, including file identifcation and metadata, to inform: \\
\-sorting and processing unappraised material \\
\-preservation planning .  \\
\\
2. Image files are the second largest group in the data set: approx 20,000. \\
What information can we gather to process further:  \\
Sorting?  \\
Deduping? \\
What metadata can we extract? \\
This issue question commonality with Jodie's dataset. \\ |
| *Issue champion* | _Who owns the issue? Include an email address if possible_ \\
Ifor ap Dafydd, [[email protected]|mailto:[email protected]] |
| *Other interested parties* \\ | [~jodie.double]\\ |
| *Possible Solution approaches* | _Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list_ \\
‘File’, (Unix command script) to identify content of disk (path, title, author, title, date of creation, date of last modification) \\
Partitioning and sorting in the spreadsheet \\
Indexing \\
Faceted search \\
exiftool for collecting more metadata \\ |
| *Context* | _Details of the institutional context to the Issue. (May be expanded at a later date)_ \\
Institution is currently working towards specifying requirements for an ingest workflow for born digital materials. \\
These unappraised - rescued materials - are an example of a legacy collection, which while safeguarded in Digital Archive has not been processed, described or made available for access. \\ |
| *Lessons Learned* | _Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_ \\
Day 1 -- Dave and Michael; Jodie, Ifor \\
What have we done: identify content of disk, formats and other metadata \\
What did we use: \-Run ‘File’, (Unix command script) to identify content of disk (path, title, author, title, date of creation, date of last modification) \\
What did we find: \\
\-“Top ten” file types form vast majority of material; \\
long tail of other formats e.g. 38,000 Word docs; 21 databases \\
\\
\\
Data for ‘top ten’ (9?) placed in Excel spreadsheet. \\
Questions: \\
Why is File the best tool/way of doing this? \\
Are there other options? \\
Next challenges: \\
Tool for assisting in sorting the data? \\
Extracting further metadata? \\
Selecting or preparing for ingest?  \\
\\
Day 2: What we looked at: 5736 directories \\
1. Excel spreadsheet for 'top ten' formats - contains valuable metadata but unweildy to look at.   \\
 4{^}th ^ most numerous file type on the list is Unknown. \\
1000 unknown results -\- these could be 1000 of same thing?  \\
Questions: \\
What is the next factor or term to sort the data?   \\
Assume that we want to keep all the data: therefore perhaps consider searching for best ways to weed out material? \\
\\
\-Dataset includes Dbase index file, and other file format types to be shared with Andrew for DROID registry identification \\ |
| *Datasets* | [REQ:Sgrin Archive]\\
[REQ:Ida Roper Herbarium archive]\\
[REQ:Leeds image duplicates and versions]\\ |
| *Solutions* | _Reference to the appropriate Solution page(s), by hyperlink_ \\
[Open Planets Foundation - File Scanner|REQ:Open Planets Foundation - File Scanner] |