
h3. Web-Archiving: File Format Identification/Characterisation
Web Archives may contain everything that can be found on the web, except for files or file types that are explicitly excluded from harvesting. This means that a Web Archive usually includes a vast number of different file types.
On the one hand, the major part of the content consists of the typical file formats for text, image, audio, and video content, like HTML, JPEG, PNG, MP3, MPEG, etc. On the other hand, as any kind of files can be offered for download by just linking to it, there will also be files that are not supposed to be displayed or played by a typical web browser environment.
From a curatorial perspective the question is: Do I need to be worried? Is there a risk that means I should take adequate measures right now? Is there any content which may become very expensive or even impossible to access?
The first step is therefore to reliably identify and characterise the content of a web archive. The particular challenge here is that fine-granular knowledge is required. For example, it is not sufficient to know that we are dealing with objects of the mime type “text/plain” or “application/pdf”.
Linguistic analysis can help in categorising the “text/plain” content into more precise content types. And a detailed analysis of “application/pdf” content can help clustering properties of the files and identify characteristics that are of special interest. This can all help us in planning any actions that we need to be taken. And we are curious to know about your ideas what else we can do\!
How can Hadoop help us here? Using the Hadoop framework and prepared sample projects for processing web archive content, we will be able to perform any kind of processing or analysis that we come up with on a large scale using a Hadoop Cluster. Together we will discuss what are the requirements to enable this and we will find out what still needs to optimised.