Web Archives may contain everything that can be found on the web, except for files or file types that are explicitly excluded from harvesting. This means that a Web Archive usually includes a vast number of different file types.
On the one hand, the major part of the content consists of the typical file formats for text, image, audio, and video content, like HTML, JPEG, PNG, MP3, MPEG, etc. On the other hand, as any kind of files can be offered for download by just linking to it, there will also be files that are not supposed to be displayed or played by a typical web browser environment.
From a curatorial perspective the question is: Do I need to be worried? Is there a risk that means I should take adequate measures right now? Is there any content which may become very expensive or even impossible to access?
The first step is therefore to reliably identify and characterise the content of a web archive. The particular challenge here is that fine-granular knowledge is required. For example, it is not sufficient to know that we are dealing with objects of the mime type “text/plain” or “application/pdf”.
Linguistic analysis can help in categorising the “text/plain” content into more precise content types. And a detailed analysis of “application/pdf” content can help clustering properties of the files and identify characteristics that are of special interest. This can all help us in planning any actions that we need to be taken. And we are curious to know about your ideas what else we can do!
How can Hadoop help us here? Using the Hadoop framework and prepared sample projects for processing web archive content, we will be able to perform any kind of processing or analysis that we come up with on a large scale using a Hadoop Cluster. Together we will discuss what are the requirements to enable this and we will find out what still needs to optimised.
Using a small data set for local development and testing, the Hadoop job will later be executed on a Hadoop cluster using a representative real world web archive data sample.
Many large-scale book digitisation projects have been carried out in institutions and organisations world wide enabling web based access to content which was only available locally in analogue form so far.
Because of the vast amount of digital objects that are produced in these projects, fully or partly automatised data processing becomes an essential part of digital collection management and quality assurance.
The digital objects of the Austrian National Library's digital book collection consists basically of the aggregated book object with technical and descriptive meta data, and of the images, layout and text content for the book pages. Due to the massive scale of digitisation in a relatively short time period and the fact that the digitised books are from the 18th century and older, there are different types of quality issues.
A unified quality measure for the digital book as a whole or for individual pages of the book (image, text, and layout) are essential to determine if any actions need to be taken in order to preserve the digital object and guarantee access on the long term.
Using the Hadoop framework, we provide the means to perform any kind of large scale book processing on a book or page level. Linguistic analysis and language detection, for example, can help us determining the quality of the OCR (Optical Character Recognition), or image analysis can help in detecting any technical or content related issues with the book page images.
Using a small data set for local development, the Hadoop job will then be executed on a Hadoop cluster using a real world data sample consisting of thousands of digital books.