4. System Overview

The Febrl system is implemented in an object oriented design with a handful of modules, each containing routines for specific tasks. The overall system is configured and controlled by a project.py module, which will be explained in detail in Chapter 5.

Record linkage consists of two main steps. The first one deals with data cleaning and standardisation, while the second performs the actual linkage (or deduplication). The user thus needs to specify various settings in order to be able to perform a cleaning/standardisation and/or a linkage/deduplication/geocoding process.

Definition of data sets
The two different types of data sets needed by Febrl are input data sets (i.e. the raw and maybe uncleaned original data) and temporary data sets (to hold cleaned and standardised records before they are linked or used in deduplication). Data sets are described in detail in Chapter 13
Definition of look-up tables
Various look-up tables are needed in the cleaning (correction lists), standardisation (tagging look-up tables) and the linkage and deduplication (frequency and geocode look-up tables) processes. Chapter 14 describes the various types of look-up tables available, their file formats and shows how to initialise and load them.
Definition of hidden Markov models (HMMs)
For name and address segmentation within the data standardisation process HMMs can be used efficiently instead of a traditional rules based approach. The application of HMMs in data standardisation in general is described in Chapter 7, while Chapter 8 explains the training process for HMMs as used within Febrl in more detail.
Definitions for date standardisation
For the standardisation of dates, special data format strings need to be given, as described in Section 6.6.
Definition of standardisation processes
This includes which fields from input data sets should be cleaned and standardised using what methods and how should the resulting cleaned and standardised records be written into temporary or output data sets. Chapter 6 deals in detail with these issues.
Definition of blocking indexes
To reduce the huge number of possible comparisons between record pairs, indexing methods (like blocking or sorting) are used. These methods are described in Section 9.1.
Definition of field comparison functions
In the record linkage and deduplication processes, record pairs are compared field by field using one of the available field comparison functions which are described in Section 9.2.
Definition of a classifier
The weight vectors resulting from record comparisons need to be classified into either links, non-links and possible links, using one of the available classifiers. Section 9.5 describes these classifiers.
Definition of logging and verbose output
The Febrl system is capable to write log information into a file and display it to standard output at various levels. Furthermore, warning and error messages that are generated by the system can be logged and displayed as well. The setting up of a project logger (which should be done at the beginning of a project.py module) is described in detail in Chapter 15.
Definition of output forms
Several output forms are possible with the current version of Febrl, including displaying a histogram and detailed views of record pairs, as well as saving record pairs into text files. Chapter 11 deals with these issues.
Setting of output assignment restrictions
After the classification of record pairs Febrl allows the application of a one-to-one assignment procedure which finds optimal assignments of record pairs. Section 11.1 deals with this topics in more details.
Starting a standardisation, linkage or deduplication process
Once all necessary settings are done, it is very simple to define and start a standardisation, linkage or deduplication process, as shown in Section 9.6.