17. Parallelism

Parallelism is built into the current version of Febrl transparent to the user. Running Febrl in parallel allows to solve problems with a shorter run-time compared to run them sequentially, or alternatively allows to solve larger problems due to the (usual) availability of larger amounts of memory on parallel computing platforms.

In order to be able to use the parallel functionality of Febrl the following software must be installed on your computing platform (assuming a parallel hardware like a multiprocessor or a cluster of personal computers or workstations is available).

MPI (Message Passing Interface)
MPI is a quasi standard for parallel programming on distributed memory platforms. It defines a large number of routines for communicating data (i.e. messages) between processors. While MPI itself only defines the standard (so that programs written in MPI are portable to various parallel platforms), there are different implementations available, some from vendors of parallel (super-) computers, others as free downloadable packages. Please see the MPI web page at

http://www-unix.mcs.anl.gov/mpi/

for more information on MPI and links to various implementations. Note that on many platforms administrator (or superuser) access rights are needed in order to be able to install an MPI implementation.
Pypar
Pypar is an efficient and easy-to-use module that allows programs/scripts written in the Python programming language to run in parallel on multiple processors and communicate using MPI. See the Pypar web page at

http://datamining.anu.edu.au/pypar

for more information and to download the package.

Once both MPI and Pypar are installed and tested successfully, you can run Febrl in parallel by using the mpirun command of your MPI implementation. For example, if you have a Febrl project module called myproject.py and you have a parallel platform with 8 processors, you can run Febrl in parallel by using

  mpirun -np 8 python myproject.py

Note: In order to be able to access the data sets and look-up tables, all processors must be able to have access to the directory (and sub-directories) defined in the Febrl object attribute febrl_path. Future version of Febrl will allow a much more sophisticated definition of parallelisation settings.

Note: Note that parallelism within Febrl is in its initial stages, and we would like to ask people who are interested in this area to contact the authors for further exchange of detailed information and experiences.

Warning: While doing extensive tests running parallel Febrl jobs in some cases we got slightly different numerical results when linking or deduplication larger data sets (with around

or more records). When comparing the results from running the same job on different numbers of processes, the final weights (as stored in a results file) sometimes differ in the range of $10^{-4}$ (fourth or fifth digit after the comma).

So far we have not found the cause of these problems, which might be part of our local MPI/Pypar installation, or within one of the Febrl modules.

We are currently working on this problem and will publish an updated and hopefully correct version of Febrl as soon as the problem is solved.