I've created a workflow for matchbox tool that uses Hadoop Streaming API directly https://github.com/openplanets/scape/tree/master/pc-qa-matchbox/hadoop/pythonwf. I use Python scripts for management of matchbox commands. That works well.
Then I wanted to extend this functionality using Taverna workflow engine. I've created two Taverna Workflows for Matchbox:
- This workflow should make use of pt-mapred functionality
- This workflow should makes use of Taverna Hadoop Wrapper (in XML format).
None of both is working despite a lot of my attempts to bring them to work.
The first one - based on pt-mapred JAR in my opinion is generally not designed for such tasks, it is possible that this WF doesn't work. The idea of the workflow it to find duplicates, having images on HDFS (/user/training/image_pairs) and using Taverna, Hadoop and Matchbox. The necessary Matchbox commands are tested locally and included in the workflow. Matchbox, Hadoop and Taverne must be preinstalled on Linux in order to execute this workflow (see attachments).
But the second workflow should just start working and tested Python scripts execution from Taverna and should actually work. But is not working and logging does not explain where the problem could be. This workflow MatchboxHadoopApi.t2flow is stored on myexperiment site http://www.myexperiment.org/workflows/3892.html and sould enable using of matchbox tool on Hadoop with Taverna. This workflow is based on Python scripts and Hadoop Streaming API included in "pythonwf" folder of pc-qa-matchbox project on github (https://github.com/openplanets/scape/tree/master/pc-qa-matchbox/hadoop/pythonwf).
For this workflow we assume that digital collection is located on HDFS and we have a list of input files in format "hdfs:///user/training/collection/00000032.jp2" - one row per file entry.
This list can be also generated in scripts. Changing python scripts user can customize the workflow and adjust it to the institutional needs.
This workflow does not apply pt-mapred JAR and uses directly Hadoop Streaming API to avoid additional dependencies. The workflow has four input paramters but could be also used with default parameters.
These parameters are:
- homepath is a path to the scripts on a local machine e.g. "/home/training/pythonwf"
- hdfspath is a path to the home directory on HDFS e.g. "/user/training"
- collectionpath is a name of the folder that comprises digital collection on HDFS e.g. "collection"
- summarypath is a name of the folder that comprises calculation results (list of possible duplicates) on HDFS e.g. "compare".
The list of possible duplicates can be found in file benchmark_result_list.csv in summary path.
The main script in a workflow is a PythonMatchboxWF.sh (works fine without Taverna) that comprises all other scripts. Experienced user could execute each workflow step in a separate module in order to better manage script parameters.
- The first step in the workflow is a preparation of input files list and is performed by CreateInputFiles.sh. Result of this step is a file with paths to collection files stored on HDFS in inputfiles folder.
- The second step is a SIFT features extraction calculated using "binary" parameter in order to improve performance. Result of this step are feature files for each input file like "00000031.jp2.SIFTComparison.descriptors.dat", "00000031.jp2.SIFTComparison.keypoints.dat" and "00000031.jp2.SIFTComparison.feat.xml.gz" stored in matchbox folder on HDFS.
- The third step is a calculation of Bag of Words (visual dictionary) performed by CmdCalculateBoW.sh. Result is stored in bow folder in bow.xml file on HDFS.
- Then we extract visual histograms using CmdExtractHistogram.sh for each input file. Result is stored in folder histogram on HDFS e.g. "00000031.jp2.BOWHistogram.feat.xml.gz".
- The final step is to perform actual comparison using CmdCompare.sh. Results are stored in compare folder on HDFS and comprise file benchmardk_result_list.csv that presents possible duplicates - one pair per row e.g. img1;img2;similarity between 0 (low) and 1 (high)
A Hadoop mapper doing just xcorrSound waveform-compare is fairly simple. Taking as input a file containing a list of pairs of paths to wav files (migrated and comparison) to compare, we can run the command line tool on each pair, and output the exit code and waveform-compare text output as mapper output. We can the add a reducer counting the number of good and the number of bad migrations.
A Taverna workflow doing xcorrSound waveform-compare is also simple, as well as a Taverna workflow calling the above described xcorrSound waveform-compare Hadoop job!
Thus we are writing an Audio Migration+QA Taverna workflow using a number of Hadoop jobs. Version 1 (simple audio QA): Taverna workflow including 3 hadoop jobs: ffmpeg migration, mpg321 conversion, waveform-compare on file-lists/directories. Version 2 would include ffprobe property extraction and comparison. I have changed my mind about the input / output fitting the tools / Taverna / Hadoop best. For now, the input to the Taverna workflow is a file containing a list of paths to mp3 files to migrate + an output path (+ number of files pr. task). This is also the input to the ffmpeg migration hadoop job and the mpg321 conversion job. The output from these will be the path to the wav file (the output directory will also contain logs). These lists of paths to ffmpeg migrated wavs and mpg321 converted wavs will then be combined in Taverna to a list of pairs of paths to wav files, which will be used as input to the xcorrSound waveform-compare hadoop job.
Work in progress. https://github.com/statsbiblioteket/scape-audio-qa