|| Dealing with difficult identification cases
|Detailed description||Identification Requirements, Format Languages, Requirements and Difficult Cases. Mutants and wild types. Strains. See below for specific examples.|
| Scalability Challenge
||The solution must be able to identify and describe the large number of formats in our collections, and their complexity.|
|Issue champion||Maureen Pennock (BL)|
| Other interested parties
||Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven (ONB)|
|Possible Solution approaches|| We need to collect concrete examples of difficult format cases, ones that we need to identify for preservation purposes, but which current tools do not adequately describe. Hopefully we can generate a richer language for describing format, and ensure that it covers the cases we need.
We can also compare the results from different identification systems to expose where the format is poorly understood or poorly described, and to drive improvements in format coverage.
|Context||We have a lot of stuff we can't identify, and a lot of stuff that is only identified at a coarse level.|
|Lessons Learned|| Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
|Training Needs|| Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
|Datasets|| All datasets! :-)
|Solutions|| SO3 Comparing identification tools
|Objectives||Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation|
|Success criteria||Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?|
|Automatic measures|| What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
* process 50 documents per second
* handle 80Gb files without crashing
* identify 99.5% of the content correctly
|Manual assessment|| Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
* Solution installable with basic linux system administration skills
* User interface understandable by non developer curators
|Actual evaluations||links to acutual evaluations of this Issue/Scenario|
- Where identification tools disagree.
- Where sources (e.g. server MIME type) disagrees with identification tools.
- What evidence the tools use. Extension, Mime, magic, partial parse, full parse, quirks mode, etc.
- Containers, optional and required. (not sure what I meant by that!)
- Multi-component formats (MDS/MDF, BIN/CUE, ShapeFile, EndNote, etc.)
- Later versions have Levels, according to KEEP TOTEM (here, sign-up required) - do we need more PUIDs?
- Distinguish PDF/A-1 A-2, B-1a, B-1b, etc?
- PDF that does not declare PDF/A but is 1.4 uses no features that PDF/A disallows (i.e. is conformant apart from the flag). Do we need to avoid a needless migration?
- PDF that declares 1.7 but only uses 1.4 features.
- TIFF 6 defines formats: TIFF 6 baseline, TIFF 5? extensions, TIFF 6 Extensions,
- Is TIFF 3 equivalent to TIFF 6 baseline? What about 4, 5? Do we need to distinguish all these cases?
- Text formats
- XML, root schema/namespace?
- Plain text, encodings, codepages, line endings, mixtures.
- Some tools for source code analysis can help here.
- e.g. GitHub's linguist uses simple methods to spot approximate file types (see here).
- The best tool I've found is ohcount which is written in C, but has a Ragel core which could be compiled to Java etc. if you don't mind re-writing the C wrapping. It can count lines of CSS in HTML etc. and was used successfully here: Use ohcount to detect source code text files
- But, do we need to extend our language for format identification to say this is mostly HTML with six lines of PHP?
- HTML variants, e.g. quirks mode.
- HTML5 and HTML with no 'versions':
- JP2 and JPX
- TheJP2 Mime Type registration declares the magic for image/jpeg2000, which for some reason differs from that in the publicly available draft spec. fcd15444-1.pdf (1a1a became 2020).
- However, the JPX and perhaps other formats use the same header and so how to distinguish? The publicly available draft spec. fcd-15444-2.pdf for JPX talks about a file type box, but does not make it clear.
- It seems they are using the File Type Box from the ISO Base Media File Format. This has the notion of a 'major brand' (format), 'major version' and 'compatible brands'. Which spec. defines it?
- Thus, if a JP2 has a ftyp box that declares jp2 or jpx as the brand, then we can spot that.
- But, the JPX spec. allows JP2 fallback, as far as I can tell. If this is indicated by, e.g. a jpx major brand and a jp2 compatible brand, do we need to capture that?
- The ftyp box is used in other important formats, so we may need to capture its meaning.