View Source

h1. PPL

PPL (for “Program for parallel Preservation Load”), is a tool for the optimized execution of parallel algorithms for digital preservation actions.
It takes a [Taverna|PT.WP.4.MS38 Basic Taverna Workbench available for testbed use] workflow as an input and automatically generates all necessary classes to run that workflow as a series of Hadoop jobs on a cluster.
The input for the Hadoop job needs to be formatted in a way that each line of the inputs starts with its line number and a separator (i.e. tab) before the actual data.
Otherwise, creating dot products or cross products efficiently would be impossible.
It is strongly recommended to use [sequence files|http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html] as input as it improves performance significantly (also see [here|http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html]).

h2. Demo

[Example workflow download|http://www.user.tu-berlin.de/schenck/data/ppl_example.t2flow]

!workflow.png!

You can download and analyse the workflow above.
It is a combination of beanshells with multiple input and output ports combined in a non-linear fashion.
Each beanshell does a different [String|http://docs.oracle.com/javase/7/docs/api/java/lang/String.html] operation on the data.
In the screencast, I will first introduce the workflow in Taverna, then show it running on Hadoop and finally talk about the template system.

[Link to screencast|http://www.user.tu-berlin.de/schenck/data/T2HCast.mp4]