This manual describes prototype software called
Febrl
designed to undertake probabilistic data cleaning and standardisation,
deduplication and record linkage. Written in the
Python
programming language, this software aims to allow health, biomedical
and other researchers to clean (standardise) and deduplicate or link
data sets of all sizes faster, with less effort and with improved
quality.
This fourth release Febrl Version 0.2.2 is a bug-fix release
of Version 0.2.1. The main features of the current release are:
- Probabilistic and rules-based cleaning and standardisation
routines for names, addresses and dates.
- A variety of supplied look-up and frequency tables for names and
addresses.
- Various comparison functions for names, addresses, dates and
localities, including approximate string comparisons, phonetic
encodings, geographical distance comparisons, and time and age
comparisons.
- Several blocking (indexing) methods, including the traditional
compound key blocking used in many record linkage programs.
- Probabilistic record linkage routines based on the classical
Fellegi and Sunter approach, as well as a flexible
classifier that allows a flexible definition of the weight
calculation.
- Process indicators that give estimations of remaining processing
times.
- Access methods for fixed format and comma-separated value (CSV)
text files, as well as SQL databases.
- Efficient temporary direct random access data set based on the
Berkeley database library.
- One-to-one assignment procedure for linked record pairs based on
the Auction algorithm.
- Supports parallelism for higher performance on parallel
platforms, based on MPI (Message Passing Interface), a
standard for parallel programming, and Pypar, an
efficient and easy-to-use module that allows Python programs to
run in parallel on multiple processors and communicate using
MPI.
- A database generator which allows the creation of data sets of
randomly created records (containing names, addresses and dates)
with the possibility to include duplicate records with randomly
introduced modifications. This allows for easy testing and
evaluation of linkage (deduplication) processes.
- Example project modules and example data sets allowing simple
running of Febrl projects without any modifications
needed.
- This extensive manual.
The authors would be grateful if users of Febrl would inform
us (by e-mail) of how they have used the system. We are particularly
interested in references to scientific papers or reports which mention
or cite Febrl.
This document is subject to the ANUOS License Version 1.1 (the
License, see Appendix H of this document);
you may not use this document except in compliance with the License.
All Febrl computer program code and associated data files
and documentation, including this document, are distributed under the
License on an AS IS basis, WITHOUT WARRANTY OF ANY KIND,
either express or implied. See the License for the specific language
governing rights and limitations under the License.