This manual describes prototype software called
Febrl
designed to undertake probabilistic data cleaning and standardisation,
deduplication and record linkage. Written in the
Python
programming language, this software aims to allow health, biomedical
and other researchers to clean (standardise) and deduplicate or link
data sets of all sizes faster, with less effort and with improved
quality.
This fifth release Febrl Version 0.3.1 contains a new main
feature which is geocoding, as well as several smaller updated or
improved features. The main features of the current release are:
- Probabilistic and rules-based cleaning and standardisation
routines for names, addresses, dates and telephone numbers
(new).
- A geocoding matching system based on the Australian
G-NAF (Geocoded National Address File) database.
- A variety of supplied look-up and frequency tables for names and
addresses.
- Various comparison functions for names, addresses, dates and
localities, including approximate string comparisons, phonetic
encodings, geographical distance comparisons, and time and age
comparisons. Two new approximate string comparison methods
(bag distance and compression based) have been added in this
release.
- Several blocking (indexing) methods, including the traditional
compound key blocking used in many record linkage programs.
- Probabilistic record linkage routines based on the classical
Fellegi and Sunter approach, as well as a flexible
classifier that allows a flexible definition of the weight
calculation.
- (New) Possibility to save the raw field comparison matching
weights into a text file (including record numbers or identifiers
if available).
- Process indicators that give estimations of remaining processing
times.
- Access methods for fixed format and comma-separated value (CSV)
text files, as well as SQL databases (MySQL and new PostgreSQL).
- Efficient temporary direct random access data set based on the
Berkeley database library.
- Possibility to save linkage and deduplication results into a
comma-separated value (CSV) text file (new).
- One-to-one assignment procedure for linked record pairs based on
the Auction algorithm.
- Supports parallelism for higher performance on parallel
platforms, based on MPI (Message Passing Interface), a
standard for parallel programming, and Pypar, an
efficient and easy-to-use module that allows Python programs to
run in parallel on multiple processors and communicate using
MPI.
- A data set generator which allows the creation of data sets of
randomly created records (containing names, addresses, dates,
and phone and identifier numbers) with the possibility to
include duplicate records with randomly introduced
modifications. This allows for easy testing and evaluation of
linkage (deduplication) processes.
- Example project modules and example data sets allowing simple
running of Febrl projects without any modifications
needed.
- An new auxiliary program fileanalysis.py which
analyses a data file and collects statistical information that
can be useful for choosing blocking criterias.
- This extensive manual.
The authors would be grateful if users of Febrl would inform
us (by e-mail) of how they have used the system. We are particularly
interested in references to scientific papers or reports which mention
or cite Febrl (please see next page).
Citing Febrl
If you want to refer to Febrl in a publication, please cite
our PAKDD-2004 paper Febrl - A Parallel Open Source Data Linkage
System. The full citation is:
Febrl - A Parallel Open Source Data Linkage System
Peter Christen, Tim Churches and Markus Hegland
Proceedings of the 8th Pacific-Asia Conference, PAKDD 2004, Sydney,
Australia, May 26-28, 2004. Pages 638 - 647.
Springer Lecture Notes in Artificial Intelligence, Volume 3056.
This document is subject to the ANUOS License Version 1.2 (the
License, see Appendix H of this document);
you may not use this document except in compliance with the License.
All Febrl computer program code and associated data files
and documentation, including this document, are distributed under the
License on an AS IS basis, WITHOUT WARRANTY OF ANY KIND,
either express or implied. See the License for the specific language
governing rights and limitations under the License.