Identifying web content

compared with
Current by Paul Wheatley
on May 09, 2012 16:35.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (2)

View Page History
| *Title* \\ | Identifying web content |
| *Detailed description* | The Web archives team at BnF has long suspected that the MIME types declared by the server for their pages' content were not accurate, which is a problem for preservation and could impede future emulation, for instance.Thus, when expanding BnF's preservation system (SPAR) to ingest our Web archives collection, we had an ARC module developed for JHOVE 2 ([http://bitbucket.org/lbihanic/jhove2-bnf|https://bitbucket.org/lbihanic/jhove2-bnf] _ this fork will be integrated to the general release of JHOVE2 in the coming year). This module produces a report of the characteristics of the ARC files, including the declared MIME types of the content files, and an identification of those same files using the FILE utility. When comparing the results during the initial tests of the web archives ingest process, we realized the results differed, especially when scripts and softwares were concerned.We would like to be able to evaluate the accuracy of those reports and correctly identify the content of our web archives. The problem is compounded by the huge size of the collection and the vast array of file formats present. |
| *Issue champion* | _Who owns the issue? Include an email address if possible_ |
| *Issue champion* | [~lfauduet]\\ |
| *Other interested parties* \\ | _Any other parties who are also interested in applying Issue Solutions to their Datasets_ |
| *Possible Solution approaches* | _Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list_ \\ |