View Source

| *Title* \\ | Identifying web content |
| *Detailed description* | The Web archives team at BnF has long suspected that the MIME types declared by the server for their pages' content were not accurate, which is a problem for preservation and could impede future emulation, for instance.Thus, when expanding BnF's preservation system (SPAR) to ingest our Web archives collection, we had an ARC module developed for JHOVE 2 ([http://bitbucket.org/lbihanic/jhove2-bnf|https://bitbucket.org/lbihanic/jhove2-bnf] _ this fork will be integrated to the general release of JHOVE2 in the coming year). This module produces a report of the characteristics of the ARC files, including the declared MIME types of the content files, and an identification of those same files using the FILE utility. When comparing the results during the initial tests of the web archives ingest process, we realized the results differed, especially when scripts and softwares were concerned.We would like to be able to evaluate the accuracy of those reports and correctly identify the content of our web archives. The problem is compounded by the huge size of the collection and the vast array of file formats present. |
| *Issue champion* | [~lfauduet]\\ |
| *Other interested parties* \\ | _Any other parties who are also interested in applying Issue Solutions to their Datasets_ |
| *Possible Solution approaches* | _Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list_ \\ |
| *Context* | [Bibliothèque nationale de France (National Library of France, BnF)|REQ:Bibliothèque nationale de France (National Library of France, BnF)]\\ |
| *Lessons Learned* | _Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_ \\ |
| *Datasets* | [REQ:French Web Archives]\\ |
| *Solutions* | [REQ:Server MIME Type Correction]\\ |