|
|
|
Febrl - Freely extensible biomedical record linkage |
|
|
|
4. System Overview
The Febrl system is implemented in an object oriented design
with a handful of modules, each containing routines for specific
tasks. The overall system is configured and controlled by a
project.py module, which will be explained
in detail in Chapter 5.
Record linkage consists of two main steps. The first one deals with
data cleaning and
standardisation, while the second performs the actual linkage (or
deduplication). The user thus needs to specify various settings in
order to be able to perform a cleaning/standardisation and/or a
linkage/deduplication/geocoding process.
- Definition of data sets
The two different types of data sets needed by Febrl
are input data sets (i.e. the raw and maybe uncleaned original
data) and temporary data sets (to hold cleaned and standardised
records before they are linked or used in deduplication).
Data sets are described in detail in
Chapter 13
- Definition of look-up tables
Various look-up tables are needed in the cleaning (correction
lists), standardisation (tagging look-up tables) and the linkage
and deduplication (frequency and geocode look-up tables)
processes. Chapter 14 describes the
various types of look-up tables available, their file formats
and shows how to initialise and load them.
- Definition of hidden Markov models (HMMs)
For name and address segmentation within the data
standardisation process HMMs can be used efficiently instead of
a traditional rules based approach. The application of HMMs in
data standardisation in general is described in
Chapter 7, while
Chapter 8 explains the training process
for HMMs as used within Febrl in more detail.
- Definitions for date standardisation
For the standardisation of dates, special data format strings
need to be given, as described in
Section 6.6.
- Definition of standardisation processes
This includes which fields from input data sets should be
cleaned and standardised using what methods and how should the
resulting cleaned and standardised records be written into
temporary or output data sets.
Chapter 6 deals in detail with these
issues.
- Definition of blocking indexes
To reduce the huge number of possible comparisons between record
pairs, indexing methods (like blocking or sorting) are used.
These methods are described in Section 9.1.
- Definition of field comparison functions
In the record linkage and deduplication processes, record pairs
are compared field by field using one of the available field
comparison functions which are described in
Section 9.2.
- Definition of a classifier
The weight vectors resulting from record comparisons need to be
classified into either links, non-links and possible links,
using one of the available classifiers.
Section 9.5 describes these
classifiers.
- Definition of logging and verbose output
The Febrl system is capable to write log information
into a file and display it to standard output at various levels.
Furthermore, warning and error messages that are generated by
the system can be logged and displayed as well. The setting up
of a project logger (which should be done at the beginning of a
project.py module) is described in detail in
Chapter 15.
- Definition of output forms
Several output forms are possible with the current version of
Febrl, including displaying a histogram and detailed
views of record pairs, as well as saving record pairs into text
files. Chapter 11 deals with these issues.
- Setting of output assignment restrictions
After the classification of record pairs Febrl allows
the application of a one-to-one assignment procedure
which finds optimal assignments of record pairs.
Section 11.1 deals with this topics in more
details.
- Starting a standardisation, linkage or deduplication process
Once all necessary settings are done, it is very simple to
define and start a standardisation, linkage or deduplication
process, as shown in Section 9.6.
Release 0.3.1, documentation updated on July 1, 2005.