Parallel processing of identification and characterisation jobs

Version 1 by Maurice de Rooij
on Jan 30, 2013 14:15.

compared with
Current by Maurice de Rooij
on Jan 30, 2013 14:18.

(show comment)
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (2)

View Page History
Here's some ideas and suggestions regarding parallel processing, focusing on running identification and characterisation jobs in parallel. Please contribute and comment! [~techmaurice] / 30-01-2012

---

IMHO we should also give multi/parallel processing more attention. Most (non-Java) tools are not designed to benefit from modern multi-core architectures. Luckily this is easy to accomplish by using simple wrapper scripts. However there are potential pitfalls, such as race conditions, deadlocks and memory leaks.

A good example is the OPF format identification tool [TR:Fido] which is a single process application. Although it is able to process around 200 files/second on most standard Linux distros, it should be possible to process that amount of files TIMES the number of processors you have available. On a Blade system with 12 cores FIDO should be able to process at least around 2000 files per second.

In order to prove and test this I will bring a prototype Python multi-process wrapper with me which could then be re-used for other purposes.

Also please note I explicitly do not mention multi threading, which is a different type of beast. Multi threading suffers from the same pitfalls as multi processing, possibly even more due to the fact that in most cases multiple threads share resources.

[~techmaurice] / 22-01-2013

Maurice, dunno if you're interested, but this [http://www.boddie.org.uk/python/pprocess.html] might help. I also wonder if we can use something like ppss [http://code.google.com/p/ppss/] ?

: Thanks, haven't seen these yet, been experimenting using the default Python multiprocess package, which is quite alright. [~techmaurice] / 28-01-2013

You may also like GNU Parallel ([http://www.gnu.org/software/parallel/]), which makes simple multiprocessing somewhat more accessible. [~anjackson]/2013-01-28

The Python Wiki is back after "an attack" - it has a very good list: [http://wiki.python.org/moin/ParallelProcessing] Peter Cliff/2013-01-28