PPL (for “Program for parallel Preservation Load”), is a tool for the optimized execution of parallel algorithms for digital preservation actions.
It takes a Taverna workflow as an input and automatically generates all necessary classes to run that workflow as a series of Hadoop jobs on a cluster.
The input for the Hadoop job needs to be formatted in a way that each line of the inputs starts with its line number and a separator (i.e. tab) before the actual data.
Otherwise, creating dot products or cross products efficiently would be impossible.
It is strongly recommended to use sequence files as input as it improves performance significantly (also see here).
You can download and analyse the workflow above.
It is a combination of beanshells with multiple input and output ports combined in a non-linear fashion.
Each beanshell does a different String operation on the data.
In the screencast, I will first introduce the workflow in Taverna, then show it running on Hadoop and finally talk about the template system.