Front Matter

See Appendix H of this document for the license conditions under which this document and the computer programs described in it may be used.

This manual describes prototype software called Febrl designed to undertake probabilistic data cleaning and standardisation, deduplication and record linkage. Written in the Python programming language, this software aims to allow health, biomedical and other researchers to clean (standardise) and deduplicate or link data sets of all sizes faster, with less effort and with improved quality.

This fourth release Febrl Version 0.2.2 is a bug-fix release of Version 0.2.1. The main features of the current release are:

Probabilistic and rules-based cleaning and standardisation routines for names, addresses and dates.
A variety of supplied look-up and frequency tables for names and addresses.
Various comparison functions for names, addresses, dates and localities, including approximate string comparisons, phonetic encodings, geographical distance comparisons, and time and age comparisons.
Several blocking (indexing) methods, including the traditional compound key blocking used in many record linkage programs.
Probabilistic record linkage routines based on the classical Fellegi and Sunter approach, as well as a flexible classifier that allows a flexible definition of the weight calculation.
Process indicators that give estimations of remaining processing times.
Access methods for fixed format and comma-separated value (CSV) text files, as well as SQL databases.
Efficient temporary direct random access data set based on the Berkeley database library.
One-to-one assignment procedure for linked record pairs based on the Auction algorithm.
Supports parallelism for higher performance on parallel platforms, based on MPI (Message Passing Interface), a standard for parallel programming, and Pypar, an efficient and easy-to-use module that allows Python programs to run in parallel on multiple processors and communicate using MPI.
A database generator which allows the creation of data sets of randomly created records (containing names, addresses and dates) with the possibility to include duplicate records with randomly introduced modifications. This allows for easy testing and evaluation of linkage (deduplication) processes.
Example project modules and example data sets allowing simple running of Febrl projects without any modifications needed.
This extensive manual.

The authors would be grateful if users of Febrl would inform us (by e-mail) of how they have used the system. We are particularly interested in references to scientific papers or reports which mention or cite Febrl.

See Also:

Febrl Project Web Site
for information about this project.
Python Web Site
for information on the Python programming language.
MPI (Message Passing Interface) Web Site
for information on MPI.
Pypar Web Site
for information on Pypar.
MySQL Web Site
for information on the open source database MySQL.
MySQL for Python
for information on the Python MySQL module.
Sleepycat Software
for information on the Berkeley database library.
Python Bindings for BerkeleyDB
for information on the Python bsddb3 module.
Logilab HMM Python module
for information on Logilab's Python hidden Markov model hmm module.

This document is subject to the ANUOS License Version 1.1 (the License, see Appendix H of this document); you may not use this document except in compliance with the License. All Febrl computer program code and associated data files and documentation, including this document, are distributed under the License on an AS IS basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.

Front Matter

Abstract: