Bolette Jurik (SB)
The workflow is the same as SB Experiment SO4 Audio mp3 to wav Migration and QA Workflow.
- Migration from Mp3 to Wav using FFmpeg
- Validation that the migrated file is a correct file in the wanted format using JHOVE2
- Extract and compare header information properties of the original and the migrated files using Ffprobe
- Convert the Mp3 file to Wav using MPG321
- Compare the two Wav files using xcorrSound waveform-compare (earlier migrationQA)
The difference is that the workflow is written as a number of Hadoop jobs / Hadoop mappers instead of a Taverna workflow.
The project is available from https://github.com/statsbiblioteket/scape-audio-qa.
In addition there now is a Taverna workflow combining three of these Hadoop jobs.
To sum up what this workflow does, is migration, conversion and content comparison. The top left box (nested workflow) migrates a list of mp3s to wav files using a Hadoop map-reduce job using the command line tool Ffmpeg, and outputs a list of migrated wav files. The top right box converts the same list of mp3s to wav files using another Hadoop map-reduce job using the command line tool mpg321, and outputs a list of converted wav files. The Taverna work flow then puts the two lists of wav files together and the bottom box receives a list of pairs of wav files to compare. The bottom box compares the content of the paired files using a Hadoop map-reduce job using the xcorrSound waveform-compare commandline tool, and outputs the results of the comparisons.
The file containing the list of mp3 files to be migrated is available from HDFS. The mp3 files are stored on NFS and the resulting wav files are written to NFS. This has a number of reasons.
- The first is that the audio tools, we are using, were written to read from and write to NFS.
- Also at SB digitally preserved material does not reside on HDFS, which means that in order to migrate from and to HDFS, we would first need to copy the mp3s to HDFS and later copy the wavs from HDFS. These extra copy operations are expensive, when we are talking large-scale audio collections.
- Finally the SB Hadoop Platform is set up using network storage as local storage, which means that we do not exploit the HDFS locality property, and thus accessing the files on NFS rather than HDFS does not present a large overhead.
The preservation event and log files are all written to HDFS. This means we have a rather complex input/output model with input from both HDFS and NFS and also with output to both HDFS and NFS. And this is of course only an experiment! If this workflow is going to be used in production, we need to add the repository connection, such that data can be both retrieved from the repository and written to the repository.
What we would like to do next is:
- Run an experiment using 1TB of mp3 files on the SB Hadoop cluster. This however requires some updates to the workflow. For 1TB input mp3 files, the workflow currently generates approximately 25TB of output and temporary wav files. Our test set-up is not suited for this, so we would like to delete these files along the way. Thus we would like the Taverna workflow to work on lists of lists of files. We can then limit the size of data written to eg. 2TB, then delete before continuing, as the only important output of the experiments are the comparison results and performance measurements. Also we can experiment with sending the Hadoop jobs lists of different sizes.
- Extend the workflow with property comparison. The waveform-compare tool only compares sound waves; it does not look at the header information. This should be part of a quality assurance of a migration. The reason this is not top priority is that FFprobe property extraction and comparison is very fast, and will probably not affect overall workflow performance much.