Rune Ferneke-Nielsen (SB)
The idea behind this experiment is that you have a digital newspaper collection, in JPEG 2000 format, and you want to verify that certain properties hold true for every file in the collection. The properties that should hold true are specified in a control policy, which at least contains information about the digital newspaper collection.
The first iteration of the experiment will use a very simple setup and focus on processing the files using jpylyzer - we would like to get a first indication of the performance without bringing extra complexity into the equation. Therefore, files will be read from local storage instead of using our repositories as would normally be the case. Moreover, output from the processing will be discarded - failing processes being the exception - instead of being stored in our repositories. The Hadoop configuration has not been altered, apart from the necessary settings that corresponds to our cluster.
First step will extract meta data, using jpylyzer, from each file in the newspaper collection. Second step will compare the extracted meta-data against the control policy, and report on any differences.
- Where are newspaper collection files stored - in a repository, on local storage (outside of Hadoop), hdfs storage?
- What is the appropriate number of concurrently running tasks? One way to handle this is by specifying split size, which will determine how many map tasks to start. Moreover, jpylyzer is able to handle several input paths, which is kind of a second level in handling concurrently running tasks. Specification of hardware should off course be taken into consideration in this discussion (available nodes, cpu cores & threads, memory, e.t.c.).
- Should generated meta-data be stored, and where - in a repository, on local storage, hdfs storage, discard it?
Improvements & suggestions
- configure 'fair scheduler' for Hadoop cluster, thereby controlling the number of simultaneous running (map) tasks.
- as of February 2014, this experiment is missing a component/tool for converting tool-specific output into a scape-generic output - this has been mocked for the experiment.
Building upon the results from the first iteration, we want the experiment to reflect our reality more. We extend the setup further by adding repositories, where data will be read from and written to. In details, we will use a Fedora-based repository for reading and writing content meta-data, and a bit repository for reading content. By adding these systems, we need to extend the experiment with components that can load and store data in an efficient manner.
The environment has been extending to also include the two repositories, containing metadata and content of the images.
- Extracting metadata from Fedora-based repository
- Performing quality assurance on Hadoop platform
- Storing metadata into Fedora-based repository
Step 1 can be split into:
- provide list of IDs for objects to extract from repository
- transform objects into METS documents; sample
- store METS documents in a Hadoop sequence file
Step 2 can be split into:
- read sequence file, and for each METS document
- get file reference to locate image file on NFS mount, can be found under <mets:file> node
- validate image using jpylyzer and control policy
- write image metadata and validation result into METS document
- store METS documents in a sequence file
Step 3 can be split into:
- read updated METS documents from sequence file, and for each METS document
- update corresponding repository object with changes from METS document
The SCAPE Stager & Loader components
Extracting and storing data in the repository is handled by the SCAPE Stager and Loader components, which both interact with the repository through a SCAPE Data Connector.
Policy statements that relate to this experiment and any evaluation criteria taken from SCAPE metrics