View Source

h1. PPL

PPL (for “Program for parallel Preservation Load”), is a tool for the optimized execution of parallel algorithms for digital preservation actions.
It takes a [Taverna|PT.WP.4.MS38 Basic Taverna Workbench available for testbed use] workflow as an input and automatically generates all necessary classes to run that workflow as a series of Hadoop jobs on a cluster.
The input for the Hadoop job needs to be formatted in a way that each line of the inputs starts with its line number and a separator (i.e. tab) before the actual data.
Otherwise, creating dot products or cross products efficiently would be impossible.
It is strongly recommended to use [sequence files|] as input as it improves performance significantly (also see [here|]).

h2. Demo

[Example workflow download|]


You can download and analyse the workflow above.
It is a combination of beanshells with multiple input and output ports combined in a non-linear fashion.
Each beanshell does a different [String|] operation on the data.
In the screencast, I will first introduce the workflow in Taverna, then show it running on Hadoop and finally talk about the template system.

[Link to screencast|]