Identifying web content

Skip to end of metadata
Go to start of metadata
Title
Identifying web content
Detailed description The Web archives team at BnF has long suspected that the MIME types declared by the server for their pages' content were not accurate, which is a problem for preservation and could impede future emulation, for instance.Thus, when expanding BnF's preservation system (SPAR) to ingest our Web archives collection, we had an ARC module developed for JHOVE 2 (http://bitbucket.org/lbihanic/jhove2-bnf _ this fork will be integrated to the general release of JHOVE2 in the coming year). This module produces a report of the characteristics of the ARC files, including the declared MIME types of the content files, and an identification of those same files using the FILE utility. When comparing the results during the initial tests of the web archives ingest process, we realized the results differed, especially when scripts and softwares were concerned.We would like to be able to evaluate the accuracy of those reports and correctly identify the content of our web archives. The problem is compounded by the huge size of the collection and the vast array of file formats present.
Issue champion Louise Fauduet
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets
Possible Solution approaches Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list
Context Bibliothèque nationale de France (National Library of France, BnF)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets French Web Archives
Solutions Server MIME Type Correction
Labels:
issue issue Delete
york_hackathon york_hackathon Delete
identification identification Delete
unknown_characteristics unknown_characteristics Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.