compared with
Current by Martin Schenck
on Feb 06, 2013 14:45.

(show comment)
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (4)

View Page History
h1. PPL

TODO
PPL (for “Program for parallel Preservation Load”), is a tool for the optimized execution of parallel algorithms for digital preservation actions.
It takes a [Taverna|PT.WP.4.MS38 Basic Taverna Workbench available for testbed use] workflow as an input and automatically generates all necessary classes to run that workflow as a series of Hadoop jobs on a cluster.
The input for the Hadoop job needs to be formatted in a way that each line of the inputs starts with its line number and a separator (i.e. tab) before the actual data.
Otherwise, creating dot products or cross products efficiently would be impossible.
It is strongly recommended to use [sequence files|http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html] as input as it improves performance significantly (also see [here|http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html]).

h2. Demo
!workflow.png!

TODO
You can download and analyse the workflow above.
It is a combination of beanshells with multiple input and output ports combined in a non-linear fashion.
Each beanshell does a different [String|http://docs.oracle.com/javase/7/docs/api/java/lang/String.html] operation on the data.
In the screencast, I will first introduce the workflow in Taverna, then show it running on Hadoop and finally talk about the template system.

[Link to screencast|http://www.user.tu-berlin.de/schenck/data/T2HCast.mp4]