The goal of this session is to modify an existing preservation workflow so that it can be executed on a scalable environment. In order to achieve this task we will make use of Taverna, Hadoop, and a mapReduce application for wrapping command-line tools.
In the following, we make use of a workflow that automatically converts TIFF images to JP2 (JPEG 2000) using OpenJPEG, followed by a format validation which is done using Jpylyzer. Open a terminal window and perform this task on the command-line against a single test file. A script
that applies this task against multiple image files can be found in the folder Tiff2JP2-Example. Use the script to experiment with sets of sample images.
The task from the previous exercise can be easily implemented, visualized, and run as a Taverna workflow. The Tiff2JP2-Example directory contains a workflow called
which implements this scenario. The workflow takes a reference to a directory as input and executes the tools chain for every TIFF file contained in that folder. Open the workflow in Taverna and investigate how the Taverna's tool invocation mechanism is used to execute the preservation tools. Also examine the remaining activities used to select the files and generate the proper command-line statement. Finally, run the workflow against a test data set and examine the results.
In this section we will use a mapReduce application to execute OpenJpeg against a set of images on Hadoop.
The SCAPE tool-wrapper is used to execute (preservation) command-line tools on a Hadoop cluster. As Hadoop input data resides on the HDFS file system which is typically not supported by a conventional command-line tool, the wrapper is required to handle the communication between the Hadoop runtime environment and the tools to be executed. You can find the tool-wrapper within the Tiff2JP2-Example folder as a Java archive (.jar file).
A set of tool specification documents are provided within the toolspecs folder. These files describe different command-line patterns telling the tool wrapper how the user wants a particular tool being invoked (e.g. using various parameters). Take a look at the tool specification for OpenJPEG, we will make use of the subsequent steps.
Now copy the entire toolspecs folder to your home directory on HDFS using the put command
Verify that the toolspecs are available on HDFS using -ls and -cat commands.
The toolwrapper also requires an input file that specifies the parameters for the tool invokations, ie. the location of the input files and the output files. This file will be generated within the Taverna workflow. However, if the toolwrapper is used from the command-line, the input file has to be generated manually.
Use the script
to create an input file for the TIFF files you would like to migrate to JP2. For example type:
to generate an input file for all .tiff-images in the files folder.
Finally, we have to make the input files available on the distributed file system. Copy the folder with the input files you specified in the previous step to your home directory on HDFS. Make sure that the references used in the file input-files.txt correspond with location of your input files on HDFS.
Now, we are ready to start the tool-wrapper. Use the following command to start the mapReduce application:
(The last parameter sets the location of the toolspecs on HDFS). Monitor the execution of your application using the web interfaces and log-files. The JP2-files will be put into the input directory. Output written by MapReduce will be put into a folder called out by default.
In this session we will use a Taverna workflow that implements the Tiff2JP2 workflow by using Hadoop as the execution environment. The workflow
is available in the Tiff2JP2-Example directory and can be loaded into Taverna workbench just like any other workflow.
Take a look at the tool invocation activities and compare them with the previously executed workflow, which is using local tool invocations. Also, note the Location tab of the tool configuration dialog. This is presently set to "default local". While this configuration is fine for the virtual machine setup we are using, it would be required to add a new location to access a remote cluster. Use the Manage locations button to find out how remote locations can be added. The password of Taverna's Credential Manager is: master.
Let's start the workflow on the hadoop cluster that is running on your local virtual machine! Ensure the folder containing a set of Tiff images to process in this workflow is ready on your HDFS home directory. The workflow exposes two input ports: (1) for hadoopjobjar use "Set file location" and point Taverna to the toolwrapper.jar; (2) for inputdir use "Set value" and provide the path to your input files on HDFS, e.g. /user/bob/myTiffs.
Before you run the workflow be sure to have the MapReduce Administration Interface open in a browser window. Finally, execute the workflow and monitor the execution of the different MapReduce jobs as well as the created results.
If we did not run out of time until now, take a look at the following directory:
The folder contains a more complex version of the Tiff to JP2 migration workflow, which for example also makes use of the SCAPE Matchbox tool to visually compare the source and target files. This workflow can also be run on your virtual machine.