Title |
Digital objects archive contains unidentified content |
Detailed description | From an archiving point of view, if there is no detailed information about the exact content of an archive, no preservation planning or any preservation actions can be undertaken. For example, if old Microsoft proprietary file formats that are implemented and used in the Microsoft Office system are part of the archive, it is necessary to know of which format the objects exactly are in order to know which preservation tools can best preserve the object. From the user’s point of view, the challenge is to provide an authentic experience when providing access to archived web content. As the technical environment (operating systems, renderers, browsers, plugins, etc.) changes continuously, the current user experience might differ significantly or even lead to information loss (e.g. footnotes are not rendered in the new environment). |
Scalability Challenge |
EXL: It is not clear if scalability problem exists and if provided data sets reflect it. |
Issue champion | Markus Raditsch![]() |
Other interested parties |
SB: <comment_missing> KB: Reliable identification is an apriori for any preservation strategy. ONB: Definitely an issue with high priority BL: <comment_missing> IM: Metadata extraction / characterisation |
Possible Solution approaches | Identification is a necessary condition for many kinds of preservation measures. The content of a webarchive must be well known in order to plan and execute preservation actions in a sensible way. In this context some software candidates have already been proposed, like Droid, FIDO, and FITS, for example. There is a clear requirement for identifiers (URIs) on a fine- grained (format) level and on a coarse grained level (mime-types) and even on the level of properties. Identification of compound objects is important because many formats involve container formats and nested container objects. For example, a container format like AVI or MP4 is a file that describes how data and metadata are stored, and the video container format has a video track along with one or more audio tracks, and again these files can have metadata (e.g. ID3 tags for MP3). Regarding these compound objects, there is a requirement to make the depth of analysis configurable, e.g. in order to decide if compressed files and other containers should be opened and deeply analysed or not. Furthermore, it is sometimes not sufficient to identify a file as a container format, e.g. a ZIP container could be specifically as an ODF file (the Open Document format is simply a ZIP file. In summary, the format must be determined in a more fine granular way, trying to capture the actual nature of the digital item, not only the type of container format. There would be the possibility to publish the results of characterisation as RDF and create a Linked Open Data cloud that contains the format identification results. Existing vocabularies could then be used to describe dependencies of formats, e.g. in SKOS. The question here is how coarse-granular (e.g. mime-type image/tiff) identifications can be compared with fine-granular (e.g. Pronom-ID fmt/7) ones. But it was decided that there should be no manually editing of registry entries required, instead the registry should be populated automatically with characterisation results. It should however be possible to change the registry manually. Although identification has been identified as the first necessary step, it must be seen in the wider context of laying the ground for preservation planning, migration and quality assurance. Regarding the migration of content, on-the-fly migration for obsolete files would be a possibility to provide access to this content. EXL: It would be interesting to see how well Droid 6 handles different container formats. We have experienced issues with ARC format. Seems the question here is how to do Format ID on containers. Once Format ID is done, we need to extract and store the significant properties for each file within the container so that risk reporting and remediation can be done at a later time. KEEPS:
|
Context | |
Lessons Learned | Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices) |
Training Needs | Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP. |
Datasets | Web archive data sets from KB, SB (to be confirmed), ONB, IM. |
Solutions |
Evaluation
Objectives | Automation, scalability, characterization |
Success criteria | The technical implementation will make it possible to characterize the content of a web archive within an adequate time frame and with an fair correctness level. |
Automatic measures | 1. Process 100.000 objects per node per hour (on an average CPU with 2.5GHz) 2. Identify 95% of the objects correctly |
Manual assessment | As soon as the technical implementation has been set up: Process web archives fully unattended. The steps between loading the web archives into the system and receiving the characterization results must be completely transparent to the end-user of the workflow. |
Actual evaluations | links to acutual evaluations of this Issue/Scenario |
Labels: