5. Configuration and Running Febrl using a Module derived from 'project.py'

All aspects of the Febrl system are configured and controlled by a single Python module (program) that can be derived from either the project-standardise.py, project-linkage.py or project-deduplicate.py modules provided. In this section the complete project-deduplicate.py module as supplied with the current Febrl distribution is described in detail using blocks of code extracted from this module. The module project-linkage.py is very similar in its structure, with the main difference that two data sets are dealt with, while the module project-standardise.py only contains the definitions for one input and one output data set plus the necessary standardisers, but no linkage or deduplication processes are defined. Each code block is explained and references to the relevant chapters are given. It is assumed that the reader has some familiarity with the (very simple) syntax of the Python programming language in which Febrl is implemented. If not, the necessary knowledge can be gained in just a few hours from one of the tutorials or introductions listed on the Python Web site at http://www.python.org. Note that comments in the Python language start with a hash character (#) and continue until the end of a line. Alternatively a comment can go over several lines if it is enclosed by triple quotes """ as shown in the code below for the module name documentation string.

At the top of the project-deduplicate.py module is the header with version and licensing information, followed by a documentation string that gives the name of the module and a short description.

# ====================================================================
# The contents of this file are subject to the ANUOS License Version
# 1.2 (the "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at:
#   http://datamining.anu.edu.au/linkage.html
# Software distributed under the License is distributed on an "AS IS"
# basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
# the License for the specific language governing rights and 
# limitations under the License.
# The Original Software is: "project-deduplicate.py"
# The Initial Developers of the Original Software are:
#   Dr Tim Churches (Centre for Epidemiology and Research, New South
#                   Wales Department of Health)
#   Dr Peter Christen (Department of Computer Science, Australian
#                      National University)
# Copyright (C) 2002 - 2005 the Australian National University and
# others. All Rights Reserved.
# Contributors:
# Alternatively, the contents of this file may be used under the terms
# of the GNU General Public License Version 2 or later (the "GPL"), in
# which case the provisions of the GPL are applicable instead of those
# above. The GPL is available at the following URL: http://www.gnu.org
# If you wish to allow use of your version of this file only under the
# terms of the GPL, and not to allow others to use your version of
# this file under the terms of the ANUOS License, indicate your
# decision by deleting the provisions above and replace them with the
# notice and other provisions required by the GPL. If you do not
# delete the provisions above, a recipient may use your version of
# this file under the terms of any one of the ANUOS License or the GPL.
# ====================================================================
# Freely extensible biomedical record linkage (Febrl) - Version 0.3
# See: http://datamining.anu.edu.au/linkage.html
# ====================================================================
"""Module project-deduplicate.py - Configuration for a deduplication

   Briefly, what needs to be defined for a deduplication project is:
   - A Febrl object, a project, plus a project logger
   - One input data set
   - One corresponding temporary data set (with 'readwrite' access)
   - Lookup tables to be used
   - Standardisers for names, addresses and dates
   - Field comparator functions and a record comparator
   - A blocking index
   - A classifier

   and then the 'deduplicate' method can be called.

   This project module will standardised and then deduplicate the
   example data set 'dataset2.csv' given in the 'dsgen' directory.

   The directory separator 'dirsep' is a shorthand to os.sep as
   defined in febrl.py.

In the following code block all the required Febrl and standard Python modules are imported so the necessary functionalities are available.

Note: The character used to separate directories (or folders) is different on the various operating systems. For example, Unix systems use '/', while Windows/MSDOS uses . From version 0.3 onwards Febrl contains a specifier 'dirsep' which is set to the operating system's directory separator and imported when the febrl module is imported. The 'dirsep' is used in the following code blocks each time a file in a sub-directory of the main Febrl directory is accessed.

# ====================================================================
# Imports go here

import sys                    # Python system modules needed
import time

from febrl import *            # Main Febrl classes
from dataset import *          # Data set routines
from standardisation import *  # Standardisation routines
from comparison import *       # Comparison functions
from lookup import *           # Look-up table routines
from indexing import *         # Indexing and blocking routines
from simplehmm import *        # Hidden Markov model (HMM) routines
from classification import *   # Classifiers for weight vectors

The first thing that should be done is to initialise a project logger, which defines how logging information is displayed in the terminal window (or console) and in a log file. The console and file levels can be set independently from each other to values 'NOTSET', 'DEBUG', 'INFO', 'WARNING', 'ERROR' or 'CRITICAL' (see the Python standard module logging.py for more details about the Python logging system). Similar to the parallel_write argument discussed below, parallel_output (which can be set to 'host' or 'all') defines the way output is logged and displayed. For more details on console output and logging please see Chapter 15.

# ====================================================================
# Define a project logger

init_febrl_logger(log_file_name = 'febrl-example-dedup.log',
                     file_level = 'WARN',
                  console_level = 'INFO',
                      clear_log = True,
                parallel_output = 'host')

Next the system is initialised by creating a Febrl object myfebrl, and a new project is initialised as part of it. This project is assigned a name, a description and a file name (into which it can be saved at any time). Febrl loads records block wise from the input data sets, and the argument block_size controls how many records should be loaded into one such block.

Note that you can use names other than myfebrl and myproject if you wish - just be sure to use the same name throughout. In fact, you can configure multiple Febrl projects from one project.py module if you wish, using different variable names to refer to each project object. For the sake of simplicity, we will only configure one project here.

# ====================================================================
# Set up Febrl and create a new project (or load a saved project)

myfebrl = Febrl(description = 'Example Febrl instance',
                 febrl_path = '.')

myproject = myfebrl.new_project(name = 'example-dedup',
                         description = 'Deduplicate example data set',
                           file_name = 'example-deduplicate.fbr',
                          block_size = 100,
                      parallel_write = 'host')

The argument block_size sets the number of records that will be loaded and processed in a block because Febrl is working in a blocking fashion. Note that this block size has nothing to do with the blocking as used within the record linkage process. Rather, records are loaded from files in chunks (blocks) and processed, before the next block is loaded and processed.

The parallel_write argument sets the way in which data sets are written to files when Febrl is run in parallel. If set to 'host' only the host process (the one Febrl has been started on) writes into data sets. On the other hand, if set to 'all' then all processes write into local files. See Chapter 13 for more detailed information on this topic.

The next two code blocks define an input data set (initialised for reading from a CSV file) and a temporary memory based data set (to hold the cleaned and standardised records before they are deduplicated or linked and written into an output data set). See Chapter 13 for more information on the given arguments, and how to access data sets. Note that this initialisation of data sets does not mean that they are immediately loaded, but only that preparations are made to access them. Note also that fields defined in the input data set do not need to be all the columns in the data file or database table being accessed (if for example columns are not needed in a standardisation and linkage process).

Note: While in this example the temporary data set is a memory based data set, for larger projects (with data sets containing more than a couple of thousand records) it is recommended that a Shelve data set (file based) is used as temporary data set (see Chapter 13 for more information on this topic).

# ====================================================================
# Define your original input data set(s)
# Only one data set is needed for deduplication

indata = DataSetCSV(name = 'example2in',
             description = 'Example data set 2',
             access_mode = 'read',
            header_lines = 1,
               file_name = 'dsgen'+dirsep+'dataset2.csv',
                  fields = {'rec_id':0,
          fields_default = '',
            strip_fields = True,
          missing_values = ['','missing'])
# ====================================================================
# Define a temporary data set

tmpdata = DataSetMemory(name = 'example2tmp',
                 description = 'Temporary example 2 data set',
                 access_mode = 'readwrite',
                      fields = {'title':1,
# The following are fields that are passed without standardisation
# The last output field contains the probability of the address HMM
              missing_values = ['','missing'])

Various types of look-up tables are loaded from their files in the next code block. Not that for example several name tagging look-up table files are loaded into one name tagging look-up table. More details on the file format of these look-up tables and correction lists can be found in Chapter 14.

# ====================================================================
# Define and load lookup tables

name_lookup_table = TagLookupTable(name = 'Name lookup table',
                                default = '')

name_correction_list = CorrectionList(name = 'Name correction list')

surname_freq_table = FrequencyLookupTable(name = 'Surname freq table',
                                       default = 1)

address_lookup_table = TagLookupTable(name = 'Address lookup table',
                                   default = '')

addr_correction_list = CorrectionList(name = 'Address corr. list')

pc_geocode_table = GeocodeLookupTable(name = 'NSW postcode locations',
                                   default = [])

In this following code block two hidden Markov models (HMMs) are defined (first their states and then the observations, i.e. tags) and loaded from files. These HMMs are used by the name and address component standardisation processes. Further information on HMMs is given in Chapters 7 and 8.

# ====================================================================
# Define and load hidden Markov models (HMMs)

name_states = ['titl','baby','knwn','andor','gname1','gname2','ghyph',
name_tags = ['NU','AN','TI','PR','GF','GM','SN','ST','SP','HY','CO',

myname_hmm = hmm('Name HMM', name_states, name_tags)

address_states = ['wfnu','wfna1','wfna2','wfql','wfty','unnu','unty',

address_tags = ['PC','N4','NU','AN','TR','CR','LN','ST','IN','IT',

myaddress_hmm = hmm('Address HMM', address_states, address_tags)

Next follows a list of possible date formats needed to parse date strings in the data standardisation process. See Section 6.6 for more details on these format strings.

# ====================================================================
# Define a list of date parsing format strings

date_parse_formats = ['%d %m %Y',   # 24 04 2002  or  24 4 2002
                      '%d %B %Y',   # 24 Apr 2002 or  24 April 2002
                      '%m %d %Y',   # 04 24 2002  or  4 24 2002
                      '%B %d %Y',   # Apr 24 2002 or  April 24 2002
                      '%Y %m %d',   # 2002 04 24  or  2002 4 24
                      '%Y %B %d',   # 2002 Apr 24 or  2002 April 24
                      '%Y%m%d',     # 20020424  ** ISO standard **
                      '%d%m%Y',     # 24042002
                      '%m%d%Y',     # 04242002
                      '%d %m %y',   # 24 04 02    or  24 4 02
                      '%d %B %y',   # 24 Apr 02   or  24 April 02
                      '%y %m %d',   # 02 04 24    or  02 4 24
                      '%y %B %d',   # 02 Apr 24   or  02 April 24
                      '%m %d %y',   # 04 24 02    or  4 24 02
                      '%B %d %y',   # Apr 24 02   or  April 24 02
                      '%y%m%d',     # 020424
                      '%d%m%y',     # 240402
                      '%m%d%y',     # 042402

Now the different standardisation processes can be defined. Each needs input fields and output fields. It is assumed that these input fields are defined in the input data set as defined in the record standardiser further below. Similarly, output fields need to be defined in the output data set of the record standardiser. In this example the output data set of the record standardiser is the temporary data set defined earlier. This temporary data set will later be used as input for the deduplication process.

Each of the component standardiser (for dates, names or addresses) has special arguments which are explained in details in Chapter 6. It is possible to define more than one standardiser, for example if several dates need to be cleaned and standardised, or if more than one name component or several addresses are available in a record.

# ====================================================================
# Define standardisers for dates

dob_std = DateStandardiser(name = 'DOB-std',
                    description = 'Date of birth standardiser',
                   input_fields = 'date_of_birth',
                  output_fields = ['dob_day','dob_month', 'dob_year'],
                  parse_formats = date_parse_formats)

Name cleaning and standardisation can be done using a rule-based approach or a hidden Markov model (HMM) approach. The following two code blocks show definitions for both a rule-based and a HMM based name standardisers. For more details on HMM standardisation see Chapter 7, while the name standardisation is explained in Section 6.3 using rules and in Section 6.4 for the HMM based approach.

# ====================================================================
# Define a standardiser for names based on rules

name_rules_std = NameRulesStandardiser(name = 'Name-Rules',
                               input_fields = ['given_name',
                              output_fields = ['title',
                             name_corr_list = name_correction_list,
                             name_tag_table = name_lookup_table,
                                male_titles = ['mr'],
                              female_titles = ['ms'],
                            field_separator = ' ',
                           check_word_spill = True)
# ====================================================================
# Define a standardiser for name based on HMM

name_hmm_std = NameHMMStandardiser(name = 'Name-HMM',
                           input_fields = ['given_name','surname'],
                          output_fields = ['title',
                         name_corr_list = name_correction_list,
                         name_tag_table = name_lookup_table,
                            male_titles = ['mr'],
                          female_titles = ['ms'],
                               name_hmm = myname_hmm,
                        field_separator = ' ',
                       check_word_spill = True)

For addresses, currently only a HMM based standardisation approach is available. See Section 6.5 for more details and a description of all the possible arguments for this standardiser.

# ====================================================================
# Define a standardiser for address based on HMM

address_hmm_std = AddressHMMStandardiser(name = 'Address-HMM',
                                 input_fields = ['street_num',
                                                 'postcode', 'state'],
                                output_fields = ['wayfare_number',
                            address_corr_list = addr_correction_list,
                            address_tag_table = addr_lookup_table,
                              field_separator = ' ',
                                  address_hmm = myaddress_hmm)

In many data sets there are input fields that are already in a cleaned and standardised form, and which therefore can directly be copied into corresponding output fields (so they are available for a linkage or deduplication process later). The simple PassFieldStandardiser as described in Section 6.8 can be used for this. It copies the values from the input fields into the corresponding output fields without any modifications. In the example given below, values from the input field 'rec_id' are copied into the output field of the same name, and values from input field 'soc_sec_id' are copied into an output field of the same name.

# ====================================================================

pass_fields = PassFieldStandardiser(name = 'Pass fields',
                            input_fields = ['rec_id', 'soc_sec_id'],
                           output_fields = ['rec_id', 'soc_sec_id'])

Now that the component standardisers are defined, they can be passed on to a record standardiser, which defines the input and output data sets for the data cleaning and standardisation process. A check is made to determine if the input and output fields in all the component standardisers defined above are available in the input or output data sets defined in the record standardiser.

As can be seen in the comp_stand list, not all defined component standardisers need to be passed to the record standardiser. Only the ones listed here are actually used for the standardisation process. It is very easy to change the standardisation process by simple changing the component standardisers passed to the record standardiser.

# ====================================================================
# Define a record standardiser

comp_stand = [dob_std, name_rules_std, address_hmm_std, pass_fields]

# The HMM based name standardisation is not used in this example
# standardiser, uncomment the lines below (and comment the ones above)
# to use HMM standardisation for names.
#comp_stand = [dob_std, name_hmm_std, address_hmm_std, pass_fields]

example_standardiser = RecordStandardiser(name = 'Example-std',
                                   description = 'Exam. standardiser',
                                 input_dataset = indata,
                                output_dataset = tmpdata,
                                      comp_std = comp_stand)

In the next few code blocks definitions for the record linkage (or deduplication) process are given. First, a blocking index is defined with three different indexes on various field combinations and encoding methods. Febrl contains various indexing methods, including the standard blocking, sorted neighbourhood and an experimental bigram index. In the code shown below indexes for all three methods are defined but only one will later be used in the definition of the deduplication process (in this example the blocking index). Note that for a linkage process the index methods used must be the same for both data sets (i.e. it is not possible to use a blocking index for one and data set a sorted neighbourhood index for the other). See Section 9.1 for more details on the various indexing methods and their functionalities.

# ====================================================================
# Define blocking index(es) (one per temporary data set)

myblock_def = [[('surname','dmetaphone', 4),('dob_year','direct')],
               [('given_name','truncate', 3), ('postcode','direct')],

# Define one or more indexes (to be used in the classifier further
# below)

example_block_index = BlockingIndex(name = 'Index-blocking',
                                 dataset = tmpdata,
                               index_def = myblock_def)

example_sorting_index = SortingIndex(name = 'Index-sorting',
                                  dataset = tmpdata,
                                index_def = myblock_def,
                              window_size = 3)

example_bigram_index = BigramIndex(name = 'Index-bigram',
                                dataset = tmpdata,
                              index_def = myblock_def,
                              threshold = 0.75)

Similar to the definition of component standardisers earlier, we now have to define field comparison functions which will be used to compare the selected fields in the cleaned data set in order to calculate the matching weight vector. Section 9.2 describes all the field comparison functions available. The following code block shows different examples of field comparator functions. Frequency tables can be used for a number of field comparison functions, as shown in the 'surname_dmetaphone' example below and described in detail in Section 9.2.1.

# ====================================================================
# Define comparison functions for linkage

given_name_nysiis = \
          FieldComparatorEncodeString(name = 'Given name NYSIIS',
                                  fields_a = 'given_name',
                                  fields_b = 'given_name',
                                    m_prob = 0.95,
                                    u_prob = 0.001,
                            missing_weight = 0.0,
                             encode_method = 'nysiis',
                                   reverse = False)
surname_dmetaphone = \
          FieldComparatorEncodeString(name = 'Surname D-Metaphone',
                                  fields_a = 'surname',
                                  fields_b = 'surname',
                                    m_prob = 0.95,
                                    u_prob = 0.001,
                            missing_weight = 0.0,
                             encode_method = 'dmetaphone',
                                   reverse = False,
                           frequency_table = surname_freq_table,
                     freq_table_max_weight = 9.9,
                     freq_table_min_weight = -4.3)

locality_name_key = \
          FieldComparatorKeyDiff(name = 'Locality name key diff',
                             fields_a = 'locality_name',
                             fields_b = 'locality_name',
                               m_prob = 0.95,
                               u_prob = 0.001,
                       missing_weight = 0.0,
                         max_key_diff = 2)

wayfare_name_winkler = \
          FieldComparatorApproxString(name = 'Wayfare name Winkler',
                                  fields_a = 'wayfare_name',
                                  fields_b = 'wayfare_name',
                                    m_prob = 0.95,
                                    u_prob = 0.001,
                            missing_weight = 0.0,
                            compare_method = 'winkler',
                          min_approx_value = 0.7)

postcode_distance = \
          FieldComparatorDistance(name = 'Postcode distance',
                              fields_a = 'postcode',
                              fields_b = 'postcode',
                                m_prob = 0.95,
                                u_prob = 0.001,
                        missing_weight = 0.0,
                         geocode_table = pc_geocode_table,
                          max_distance = 50.0)

age = FieldComparatorAge(name = 'Age',
                     fields_a = ['dob_day', 'dob_month', 'dob_year'],
                     fields_b = ['dob_day', 'dob_month', 'dob_year'],
            m_probability_day = 0.95,
            u_probability_day = 0.03333,
          m_probability_month = 0.95,
          u_probability_month = 0.083,
           m_probability_year = 0.95,
           u_probability_year = 0.01,
                max_perc_diff = 10.0,
                     fix_date = 'today')

The defined field comparators can now be passed to a record comparator as shown in the following code block. Besides a list of field comparison functions, a record comparator must be given references to the two data sets to be compared. A check will be performed to ensure that the field names given to the field comparison functions (as in the above code blocks) are available in the given data sets.

# ====================================================================
# Define a record comparator using field comparison functions

field_comparisons = [given_name_nysiis, surname_dmetaphone,
                     locality_name_key, wayfare_name_winkler,
                     postcode_distance, age]

example_comparator = RecordComparator(tmpdata, tmpdata,

The last thing that needs to be defined before a linkage or deduplication can be started is the definition of a classifier that classifies the vectors of weights as calculated by the field comparison functions above into links, non-links or possible links. The available classifiers are described in Section 9.5. The arguments given to the classifier are the two data sets, and in our example - a classical Fellegi and Sunter classifier - the values for the lower and upper thresholds.

# ====================================================================
# Define a classifier for classifing the matching vectors

example_fs_classifier = \
          FellegiSunterClassifier(name = 'Fellegi and Sunter',
                             dataset_a = tmpdata,
                             dataset_b = tmpdata,
                       lower_threshold = 0.0,
                       upper_threshold = 30.0)

Now that all necessary components have been defined and initialised, a deduplication (or similarly a linkage) process can be started easily by invoking the corresponding method of the project object. The various defined components need to be given as arguments, as well as the range of records (first record and number of records) to be deduplicated, and the desired form of outputs. See Section 9.6 for a detailed description of how to start a deduplication or linkage process, and Chapter 11 for descriptions of all possible output forms (a new output format - comma-separated value (CSV) - can be chosen by setting the file extension for attribute 'output_rec_pair_weights' to '.csv', as shown in the code below). It is also possible to save the raw comparison weight vectors into a CSV text file for further processing and classification (e.g. using advanced machine learning techniques), by setting the attribute 'weight_vector_file' to the name of a file. More details about this can be found in Section 9.6.

# ====================================================================
# Start a deduplication task

myproject.deduplicate(input_dataset = indata,
                        tmp_dataset = tmpdata,
                   rec_standardiser = example_standardiser,
                     rec_comparator = example_comparator,
                     blocking_index = example_block_index,
                         classifier = example_fs_classifier,
                       first_record = 0,
                     number_records = 5000,
                 weight_vector_file = 'dedup-example-weight-vecs.csv',
            weight_vector_rec_field = 'rec_id',
                   output_histogram = 'dedup-example-histogram.res',
            output_rec_pair_details = 'dedup-example-details.res',
            output_rec_pair_weights = 'dedup-example-weights.csv',
                   output_threshold = 10.0,
                  output_assignment = 'one2one')

# ====================================================================


# ====================================================================

A Febrl project module is properly ended by a call to the finalise() method of your Febrl project object. This will stop the parallel environment (if Febrl has been stated in parallel) and make sure everything is properly shut down.

Note: In the case that an input data set is already available in a cleaned and standardised form, the argument rec_standardiser can be set to None, and no cleaning and standardisation is performed. Instead, the records from the input data set are directly copied into the temporary data set. Therefore, the input data set and the temporary data set must have the same field name definitions.

Once you have modified a project module according to your needs (let's say you edited a file called myproject.py), you can run it by simply typing

  python myproject.py
into your terminal (assuming you are in the correct directory), and Febrl should run according to your settings.