This proposal aims to aid interoperability between the web's current MIME-type architecture and the more finely-grained identification schemes used within the digital preservation community. This would allow preservation concepts to be embedded more easily in existing tools, and provide a route by which more sophisticated identifiers might become more broadly adopted in the futures.
The seeds of this proposal are already sown within the original MIME type proposal, RFC 2045. For example, from section 4:
[...] some formats (such as application/postscript) have version numbering conventions that are internal to the media format. Where such conventions exist, MIME does nothing to supersede them. Where no such conventions exist, a MIME media type might use a "version" parameter in the content-type field if necessary.
Furthermore, in section 4.5.1:
The "octet-stream" subtype is used to indicate that a body contains arbitrary binary data. The set of currently defined parameters is:
(1) TYPE – the general type or category of binary data. This is intended as information for the human recipient rather than for any automatic processing.
Finally, the standard allows for extension via shared but unofficial and non-standardised MIME types, prefixed with 'x-' (see section 5.1).
This proposal builds on these three areas and extends them.
First two allow given types to be described in more detail. Second two deal with describing new types.
Define version parameters for known content types
The proposal is to make the version parameter standard for all content types, and prescribe its values for know types. e.g. for PDF, we could define version strings and link them to the PRONOM IDs.
- application/pdf; version="1.0"
- sameAs: PUID:fmt/XXX
- application/pdf; version="1.1"
- application/pdf; version="1.2"
- application/pdf; version="1.3"
- application/pdf; version="1.4"
- application/pdf; version="1.5"
- application/pdf; version="1.6"
- application/pdf; version="1.7"
Define a format-uri parameter for all content types
Add a new parameter, valid for all content types, that permits the addition of a format-uri parameter that declares a linked-data endpoint that identifies the format.
- application/octet-stream; format-uri="http://purl.org/..."
A format-id or format-urn parameter that does the same thing, but which a controlled scheme instead of a URI scheme. e.g.
- format-id = "PUID:fmt/44"
Note AND relationships are captured like this:
- format-id = "PUID:fmt/44 PUID:fmt/45"
Meaning this item can be parsed as fmt/44 or fmt/45. Or relationships like this:
- format-id = "PUID:fmt/45"; format-id = "PUID:fmt/46"
Meaning this could be one of these two, but I'm not certain. I'm not sure this case makes any sense, and I'm not sure multiple values are allowed.
Formalise the octet-stream 'type' parameter
Take the type parameter, 4.5.1, and begin to standardise the strings. Not sure this is required if format-uri is in place.
- application/octet-stream; type=""
Using prefixed sub-types when standardisation cannot work
To quote the RFC
Private values (starting with "X-") may be defined bilaterally between two cooperating agents without outside registration or standardization. Such values cannot be registered or standardized.
Plus allowed to use x- vendor extensions, but only sensible for community agreed but deliberately transient, or orphan works. In general, when standardisation is either too difficult or inappropriate, the DP community could agree on these.
Impedance mismatch between PUIDs and MIME Types
Although it may appear less sophisticated the MIME Type scheme's parameters framework means it is somewhat more expressive that PUIDs.
A charset parameter is be used to indicate the character set of the file for text subtypes. The octet-stream subtype of type application is used to indicate that a body contains arbitrary binary data. One of the optional parameters for this subtype is type which is the general type or category of binary data. This is intended as information for the human recipient rather than for any automatic processing. A codecs parameter is used for audio and video media types to indicate the coder-decoder for encoding analog signals to digital and decoding digital to analog signals [RFC4281, RFC5334].
For example, the "charset" parameter is applicable to any subtype of "text", ...
i.e. you can combine any text encoding with any text type. Using PUIDs means that two separate identiefers are required, and tehre is no scheme fot tying them together. Futhermore, in general, one can imagine schemes where centralised minting is not necessary for every combination etc etc.