View Source

Here's some ideas and suggestions regarding parallel processing, focusing on running identification and characterisation jobs in parallel. Please contribute and comment! [~techmaurice] / 30-01-2012


IMHO we should also give multi/parallel processing more attention. Most (non-Java) tools are not designed to benefit from modern multi-core architectures. Luckily this is easy to accomplish by using simple wrapper scripts. However there are potential pitfalls, such as race conditions, deadlocks and memory leaks.

A good example is the OPF format identification tool [TR:Fido] which is a single process application. Although it is able to process around 200 files/second on most standard Linux distros, it should be possible to process that amount of files TIMES the number of processors you have available. On a Blade system with 12 cores FIDO should be able to process at least around 2000 files per second.

In order to prove and test this I will bring a prototype Python multi-process wrapper with me which could then be re-used for other purposes.

Also please note I explicitly do not mention multi threading, which is a different type of beast. Multi threading suffers from the same pitfalls as multi processing, possibly even more due to the fact that in most cases multiple threads share resources.

[~techmaurice] / 22-01-2013

Maurice, dunno if you're interested, but this [] might help. I also wonder if we can use something like ppss [] ?

: Thanks, haven't seen these yet, been experimenting using the default Python multiprocess package, which is quite alright. [~techmaurice] / 28-01-2013

You may also like GNU Parallel ([]), which makes simple multiprocessing somewhat more accessible. [~anjackson]/2013-01-28

The Python Wiki is back after "an attack" - it has a very good list: [] Peter Cliff/2013-01-28