Identifying content and Sorting

Skip to end of metadata
Go to start of metadata
Title
Identifying and content and Sorting
Detailed description A detailed description of the Issue. The Issue MUST focus on the busines or preservation driven challenge, and should not assume or describe a particular solution.
Gathering information about content, including file identifcation and metadata, to inform:
-sorting and processing unappraised material
-preservation planning . 

2. Image files are the second largest group in the data set: approx 20,000.
What information can we gather to process further: 
Sorting? 
Deduping?
What metadata can we extract?
This issue question commonality with Jodie's dataset.
Issue champion Who owns the issue? Include an email address if possible
Ifor ap Dafydd, ifor.ap.dafydd@llgc.org.uk
Other interested parties
Jodie Double
Possible Solution approaches Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list
‘File’, (Unix command script) to identify content of disk (path, title, author, title, date of creation, date of last modification)
Partitioning and sorting in the spreadsheet
Indexing
Faceted search
exiftool for collecting more metadata
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Institution is currently working towards specifying requirements for an ingest workflow for born digital materials.
These unappraised - rescued materials - are an example of a legacy collection, which while safeguarded in Digital Archive has not been processed, described or made available for access.
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Day 1 – Dave and Michael; Jodie, Ifor
What have we done: identify content of disk, formats and other metadata
What did we use: -Run ‘File’, (Unix command script) to identify content of disk (path, title, author, title, date of creation, date of last modification)
What did we find:
-“Top ten” file types form vast majority of material;
long tail of other formats e.g. 38,000 Word docs; 21 databases


Data for ‘top ten’ (9?) placed in Excel spreadsheet.
Questions:
Why is File the best tool/way of doing this?
Are there other options?
Next challenges:
Tool for assisting in sorting the data?
Extracting further metadata?
Selecting or preparing for ingest? 

Day 2: What we looked at: 5736 directories
1. Excel spreadsheet for 'top ten' formats - contains valuable metadata but unweildy to look at.  
 4th  most numerous file type on the list is Unknown.
1000 unknown results -- these could be 1000 of same thing? 
Questions:
What is the next factor or term to sort the data?  
Assume that we want to keep all the data: therefore perhaps consider searching for best ways to weed out material?

-Dataset includes Dbase index file, and other file format types to be shared with Andrew for DROID registry identification
Datasets Sgrin Archive
Ida Roper Herbarium archive
Leeds image duplicates and versions
Solutions Reference to the appropriate Solution page(s), by hyperlink
Open Planets Foundation - File Scanner
Labels:
issue issue Delete
york_hackathon york_hackathon Delete
obsolescence obsolescence Delete
unknown_characteristics unknown_characteristics Delete
appraisal_assessment appraisal_assessment Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.