All aspects of the Febrl system are configured and
controlled by a single Python module (program) that can be derived
from either the project-standardise.py,
project-linkage.py or project-deduplicate.py modules
provided. In this section the complete project-deduplicate.py
module as supplied with the current Febrl distribution is
described in detail using blocks of code extracted from this module.
The module project-linkage.py is very similar in its
structure, with the main difference that two data sets are dealt with,
while the module project-standardise.py only contains the
definitions for one input and one output data set plus the necessary
standardisers, but no linkage or deduplication processes are defined.
Each code block is explained and references to the relevant chapters
are given. It is assumed that the reader has some familiarity with the
(very simple) syntax of the Python programming language in which
Febrl is implemented. If not, the necessary knowledge can be
gained in just a few hours from one of the tutorials or introductions
listed on the Python Web site at http://www.python.org. Note
that comments in the Python language start with a hash character
(#
) and continue until the end of a line. Alternatively a
comment can go over several lines if it is enclosed by triple quotes
"""
as shown in the code below for the module name
documentation string.
At the top of the project-deduplicate.py module is the header with version and licensing information, followed by a documentation string that gives the name of the module and a short description.
# ==================================================================== # AUSTRALIAN NATIONAL UNIVERSITY OPEN SOURCE LICENSE (ANUOS LICENSE) # VERSION 1.2 # # The contents of this file are subject to the ANUOS License Version # 1.2 (the "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at: # # http://datamining.anu.edu.au/linkage.html # # Software distributed under the License is distributed on an "AS IS" # basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See # the License for the specific language governing rights and # limitations under the License. # # The Original Software is: "project-deduplicate.py" # # The Initial Developers of the Original Software are: # Dr Tim Churches (Centre for Epidemiology and Research, New South # Wales Department of Health) # Dr Peter Christen (Department of Computer Science, Australian # National University) # # Copyright (C) 2002 - 2005 the Australian National University and # others. All Rights Reserved. # # Contributors: # # Alternatively, the contents of this file may be used under the terms # of the GNU General Public License Version 2 or later (the "GPL"), in # which case the provisions of the GPL are applicable instead of those # above. The GPL is available at the following URL: http://www.gnu.org # If you wish to allow use of your version of this file only under the # terms of the GPL, and not to allow others to use your version of # this file under the terms of the ANUOS License, indicate your # decision by deleting the provisions above and replace them with the # notice and other provisions required by the GPL. If you do not # delete the provisions above, a recipient may use your version of # this file under the terms of any one of the ANUOS License or the GPL. # ==================================================================== # # Freely extensible biomedical record linkage (Febrl) - Version 0.3 # # See: http://datamining.anu.edu.au/linkage.html # # ====================================================================
"""Module project-deduplicate.py - Configuration for a deduplication project Briefly, what needs to be defined for a deduplication project is: - A Febrl object, a project, plus a project logger - One input data set - One corresponding temporary data set (with 'readwrite' access) - Lookup tables to be used - Standardisers for names, addresses and dates - Field comparator functions and a record comparator - A blocking index - A classifier and then the 'deduplicate' method can be called. This project module will standardised and then deduplicate the example data set 'dataset2.csv' given in the 'dsgen' directory. The directory separator 'dirsep' is a shorthand to os.sep as defined in febrl.py. """
In the following code block all the required Febrl and standard Python modules are imported so the necessary functionalities are available.
'/'
, while Windows/MSDOS uses '´
. From version 0.3
onwards Febrl contains a specifier 'dirsep'
which is
set to the operating system's directory separator and imported when
the febrl module is imported. The 'dirsep'
is used
in the following code blocks each time a file in a sub-directory of
the main Febrl directory is accessed.
# ==================================================================== # Imports go here import sys # Python system modules needed import time from febrl import * # Main Febrl classes from dataset import * # Data set routines from standardisation import * # Standardisation routines from comparison import * # Comparison functions from lookup import * # Look-up table routines from indexing import * # Indexing and blocking routines from simplehmm import * # Hidden Markov model (HMM) routines from classification import * # Classifiers for weight vectors
The first thing that should be done is to initialise a project logger,
which defines how logging information is displayed in the terminal
window (or console) and in a log file. The console and file levels
can be set independently from each other to values 'NOTSET'
,
'DEBUG'
, 'INFO'
, 'WARNING'
, 'ERROR'
or
'CRITICAL'
(see the Python standard module
logging.py for more details about the
Python logging system). Similar to the parallel_write
argument
discussed below, parallel_output
(which can be set to
'host'
or 'all'
) defines the way output is logged and
displayed. For more details on console output and logging please see
Chapter 15.
# ==================================================================== # Define a project logger init_febrl_logger(log_file_name = 'febrl-example-dedup.log', file_level = 'WARN', console_level = 'INFO', clear_log = True, parallel_output = 'host')
Next the system is initialised by creating a Febrl object
myfebrl
, and a new project is initialised as part of it. This
project is assigned a name, a description and a file name (into
which it can be saved at any time). Febrl loads records
block wise from the input data sets, and the argument
block_size
controls how many records should be loaded into one
such block.
Note that you can use names other than myfebrl
and
myproject
if you wish - just be sure to use the same name
throughout. In fact, you can configure multiple Febrl
projects from one project.py module if you wish, using
different variable names to refer to each project object. For the
sake of simplicity, we will only configure one project here.
# ==================================================================== # Set up Febrl and create a new project (or load a saved project) myfebrl = Febrl(description = 'Example Febrl instance', febrl_path = '.') myproject = myfebrl.new_project(name = 'example-dedup', description = 'Deduplicate example data set', file_name = 'example-deduplicate.fbr', block_size = 100, parallel_write = 'host')
The argument block_size
sets the number of records that will
be loaded and processed in a block because Febrl is working
in a blocking fashion. Note that this block size has nothing to do
with the blocking as used within the record linkage process. Rather,
records are loaded from files in chunks (blocks) and processed, before
the next block is loaded and processed.
The parallel_write
argument sets the way in which data sets
are written to files when Febrl is run in parallel. If set
to 'host'
only the host process (the one Febrl has
been started on) writes into data sets. On the other hand, if set to
'all'
then all processes write into local files. See
Chapter 13 for more detailed information on this
topic.
The next two code blocks define an input data set (initialised for reading from a CSV file) and a temporary memory based data set (to hold the cleaned and standardised records before they are deduplicated or linked and written into an output data set). See Chapter 13 for more information on the given arguments, and how to access data sets. Note that this initialisation of data sets does not mean that they are immediately loaded, but only that preparations are made to access them. Note also that fields defined in the input data set do not need to be all the columns in the data file or database table being accessed (if for example columns are not needed in a standardisation and linkage process).
# ==================================================================== # Define your original input data set(s) # Only one data set is needed for deduplication indata = DataSetCSV(name = 'example2in', description = 'Example data set 2', access_mode = 'read', header_lines = 1, file_name = 'dsgen'+dirsep+'dataset2.csv', fields = {'rec_id':0, 'given_name':1, 'surname':2, 'street_num':3, 'address_part_1':4, 'address_part_2':5, 'suburb':6, 'postcode':7, 'state':8, 'date_of_birth':9, 'soc_sec_id':10}, fields_default = '', strip_fields = True, missing_values = ['','missing'])
# ==================================================================== # Define a temporary data set tmpdata = DataSetMemory(name = 'example2tmp', description = 'Temporary example 2 data set', access_mode = 'readwrite', fields = {'title':1, 'gender_guess':2, 'given_name':3, 'alt_given_name':4, 'surname':5, 'alt_surname':6, 'wayfare_number':7, 'wayfare_name':8, 'wayfare_qualifier':9, 'wayfare_type':10, 'unit_number':11, 'unit_type':12, 'property_name':13, 'institution_name':14, 'institution_type':15, 'postaddress_number':16, 'postaddress_type':17, 'locality_name':18, 'locality_qualifier':19, 'postcode':20, 'territory':21, 'country':22, 'dob_day':23, 'dob_month':24, 'dob_year':25, # The following are fields that are passed without standardisation 'rec_id':0, 'soc_sec_id':26, # The last output field contains the probability of the address HMM 'address_hmm_prob':27, }, missing_values = ['','missing'])
Various types of look-up tables are loaded from their files in the next code block. Not that for example several name tagging look-up table files are loaded into one name tagging look-up table. More details on the file format of these look-up tables and correction lists can be found in Chapter 14.
# ==================================================================== # Define and load lookup tables name_lookup_table = TagLookupTable(name = 'Name lookup table', default = '') name_lookup_table.load(['data'+dirsep+'givenname_f.tbl', 'data'+dirsep+'givenname_m.tbl', 'data'+dirsep+'name_prefix.tbl', 'data'+dirsep+'name_misc.tbl', 'data'+dirsep+'saints.tbl', 'data'+dirsep+'surname.tbl', 'data'+dirsep+'title.tbl']) name_correction_list = CorrectionList(name = 'Name correction list') name_correction_list.load('data'+dirsep+'name_corr.lst') surname_freq_table = FrequencyLookupTable(name = 'Surname freq table', default = 1) surname_freq_table.load('data'+dirsep+'surname_nsw_freq.csv') address_lookup_table = TagLookupTable(name = 'Address lookup table', default = '') address_lookup_table.load(['data'+dirsep+'country.tbl', 'data'+dirsep+'address_misc.tbl', 'data'+dirsep+'address_qual.tbl', 'data'+dirsep+'institution_type.tbl', 'data'+dirsep+'locality_name_act.tbl', 'data'+dirsep+'locality_name_nsw.tbl', 'data'+dirsep+'post_address.tbl', 'data'+dirsep+'postcode_act.tbl', 'data'+dirsep+'postcode_nsw.tbl', 'data'+dirsep+'saints.tbl', 'data'+dirsep+'territory.tbl', 'data'+dirsep+'unit_type.tbl', 'data'+dirsep+'wayfare_type.tbl']) addr_correction_list = CorrectionList(name = 'Address corr. list') addr_correction_list.load('data'+dirsep+'address_corr.lst') pc_geocode_table = GeocodeLookupTable(name = 'NSW postcode locations', default = []) pc_geocode_table.load('data'+dirsep+'postcode_centroids.csv')
In this following code block two hidden Markov models (HMMs) are defined (first their states and then the observations, i.e. tags) and loaded from files. These HMMs are used by the name and address component standardisation processes. Further information on HMMs is given in Chapters 7 and 8.
# ==================================================================== # Define and load hidden Markov models (HMMs) name_states = ['titl','baby','knwn','andor','gname1','gname2','ghyph', 'gopbr','gclbr','agname1','agname2','coma','sname1', 'sname2','shyph','sopbr','sclbr','asname1','asname2', 'pref1','pref2','rubb'] name_tags = ['NU','AN','TI','PR','GF','GM','SN','ST','SP','HY','CO', 'NE','II','BO','VB','UN','RU'] myname_hmm = hmm('Name HMM', name_states, name_tags) myname_hmm.load_hmm('hmm'+dirsep+'name-absdiscount.hmm') address_states = ['wfnu','wfna1','wfna2','wfql','wfty','unnu','unty', 'prna1','prna2','inna1','inna2','inty','panu', 'paty','hyph','sla','coma','opbr','clbr','loc1', 'loc2','locql','pc','ter1','ter2','cntr1','cntr2', 'rubb'] address_tags = ['PC','N4','NU','AN','TR','CR','LN','ST','IN','IT', 'LQ','WT','WN','UT','HY','SL','CO','VB','PA','UN', 'RU'] myaddress_hmm = hmm('Address HMM', address_states, address_tags) myaddress_hmm.load_hmm('hmm'+dirsep+'address-absdiscount.hmm')
Next follows a list of possible date formats needed to parse date strings in the data standardisation process. See Section 6.6 for more details on these format strings.
# ==================================================================== # Define a list of date parsing format strings date_parse_formats = ['%d %m %Y', # 24 04 2002 or 24 4 2002 '%d %B %Y', # 24 Apr 2002 or 24 April 2002 '%m %d %Y', # 04 24 2002 or 4 24 2002 '%B %d %Y', # Apr 24 2002 or April 24 2002 '%Y %m %d', # 2002 04 24 or 2002 4 24 '%Y %B %d', # 2002 Apr 24 or 2002 April 24 '%Y%m%d', # 20020424 ** ISO standard ** '%d%m%Y', # 24042002 '%m%d%Y', # 04242002 '%d %m %y', # 24 04 02 or 24 4 02 '%d %B %y', # 24 Apr 02 or 24 April 02 '%y %m %d', # 02 04 24 or 02 4 24 '%y %B %d', # 02 Apr 24 or 02 April 24 '%m %d %y', # 04 24 02 or 4 24 02 '%B %d %y', # Apr 24 02 or April 24 02 '%y%m%d', # 020424 '%d%m%y', # 240402 '%m%d%y', # 042402 ]
Now the different standardisation processes can be defined. Each needs input fields and output fields. It is assumed that these input fields are defined in the input data set as defined in the record standardiser further below. Similarly, output fields need to be defined in the output data set of the record standardiser. In this example the output data set of the record standardiser is the temporary data set defined earlier. This temporary data set will later be used as input for the deduplication process.
Each of the component standardiser (for dates, names or addresses) has special arguments which are explained in details in Chapter 6. It is possible to define more than one standardiser, for example if several dates need to be cleaned and standardised, or if more than one name component or several addresses are available in a record.
# ==================================================================== # Define standardisers for dates dob_std = DateStandardiser(name = 'DOB-std', description = 'Date of birth standardiser', input_fields = 'date_of_birth', output_fields = ['dob_day','dob_month', 'dob_year'], parse_formats = date_parse_formats)
Name cleaning and standardisation can be done using a rule-based approach or a hidden Markov model (HMM) approach. The following two code blocks show definitions for both a rule-based and a HMM based name standardisers. For more details on HMM standardisation see Chapter 7, while the name standardisation is explained in Section 6.3 using rules and in Section 6.4 for the HMM based approach.
# ==================================================================== # Define a standardiser for names based on rules name_rules_std = NameRulesStandardiser(name = 'Name-Rules', input_fields = ['given_name', 'surname'], output_fields = ['title', 'gender_guess', 'given_name', 'alt_given_name', 'surname', 'alt_surname'], name_corr_list = name_correction_list, name_tag_table = name_lookup_table, male_titles = ['mr'], female_titles = ['ms'], field_separator = ' ', check_word_spill = True)
# ==================================================================== # Define a standardiser for name based on HMM name_hmm_std = NameHMMStandardiser(name = 'Name-HMM', input_fields = ['given_name','surname'], output_fields = ['title', 'gender_guess', 'given_name', 'alt_given_name', 'surname', 'alt_surname'], name_corr_list = name_correction_list, name_tag_table = name_lookup_table, male_titles = ['mr'], female_titles = ['ms'], name_hmm = myname_hmm, field_separator = ' ', check_word_spill = True)
For addresses, currently only a HMM based standardisation approach is available. See Section 6.5 for more details and a description of all the possible arguments for this standardiser.
# ==================================================================== # Define a standardiser for address based on HMM address_hmm_std = AddressHMMStandardiser(name = 'Address-HMM', input_fields = ['street_num', 'address_part_1', 'address_part_2', 'suburb', 'postcode', 'state'], output_fields = ['wayfare_number', 'wayfare_name', 'wayfare_qualifier', 'wayfare_type', 'unit_number', 'unit_type', 'property_name', 'institution_name', 'institution_type', 'postaddress_number', 'postaddress_type', 'locality_name', 'locality_qualifier', 'postcode', 'territory', 'country', 'address_hmm_prob'], address_corr_list = addr_correction_list, address_tag_table = addr_lookup_table, field_separator = ' ', address_hmm = myaddress_hmm)
In many data sets there are input fields that are already in a cleaned
and standardised form, and which therefore can directly be copied into
corresponding output fields (so they are available for a linkage or
deduplication process later). The simple PassFieldStandardiser
as described in Section 6.8 can be used for
this. It copies the values from the input fields into the
corresponding output fields without any modifications. In the example
given below, values from the input field 'rec_id'
are copied
into the output field of the same name, and values from input
field 'soc_sec_id'
are copied into an output field of the
same name.
# ==================================================================== pass_fields = PassFieldStandardiser(name = 'Pass fields', input_fields = ['rec_id', 'soc_sec_id'], output_fields = ['rec_id', 'soc_sec_id'])
Now that the component standardisers are defined, they can be passed on to a record standardiser, which defines the input and output data sets for the data cleaning and standardisation process. A check is made to determine if the input and output fields in all the component standardisers defined above are available in the input or output data sets defined in the record standardiser.
As can be seen in the comp_stand
list, not all defined
component standardisers need to be passed to the record standardiser.
Only the ones listed here are actually used for the standardisation
process. It is very easy to change the standardisation process by
simple changing the component standardisers passed to the record
standardiser.
# ==================================================================== # Define a record standardiser comp_stand = [dob_std, name_rules_std, address_hmm_std, pass_fields] # The HMM based name standardisation is not used in this example # standardiser, uncomment the lines below (and comment the ones above) # to use HMM standardisation for names. # #comp_stand = [dob_std, name_hmm_std, address_hmm_std, pass_fields] example_standardiser = RecordStandardiser(name = 'Example-std', description = 'Exam. standardiser', input_dataset = indata, output_dataset = tmpdata, comp_std = comp_stand)
In the next few code blocks definitions for the record linkage (or deduplication) process are given. First, a blocking index is defined with three different indexes on various field combinations and encoding methods. Febrl contains various indexing methods, including the standard blocking, sorted neighbourhood and an experimental bigram index. In the code shown below indexes for all three methods are defined but only one will later be used in the definition of the deduplication process (in this example the blocking index). Note that for a linkage process the index methods used must be the same for both data sets (i.e. it is not possible to use a blocking index for one and data set a sorted neighbourhood index for the other). See Section 9.1 for more details on the various indexing methods and their functionalities.
# ==================================================================== # Define blocking index(es) (one per temporary data set) myblock_def = [[('surname','dmetaphone', 4),('dob_year','direct')], [('given_name','truncate', 3), ('postcode','direct')], [('dob_month','direct'),('locality_name','nysiis')], ] # Define one or more indexes (to be used in the classifier further # below) example_block_index = BlockingIndex(name = 'Index-blocking', dataset = tmpdata, index_def = myblock_def) example_sorting_index = SortingIndex(name = 'Index-sorting', dataset = tmpdata, index_def = myblock_def, window_size = 3) example_bigram_index = BigramIndex(name = 'Index-bigram', dataset = tmpdata, index_def = myblock_def, threshold = 0.75)
Similar to the definition of component standardisers earlier, we now
have to define field comparison functions which will be used to
compare the selected fields in the cleaned data set in order to
calculate the matching weight vector.
Section 9.2 describes all the field
comparison functions available. The following code block shows
different examples of field comparator functions. Frequency tables
can be used for a number of field comparison functions, as shown in
the 'surname_dmetaphone'
example below and described in detail
in Section 9.2.1.
# ==================================================================== # Define comparison functions for linkage given_name_nysiis = \ FieldComparatorEncodeString(name = 'Given name NYSIIS', fields_a = 'given_name', fields_b = 'given_name', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, encode_method = 'nysiis', reverse = False)
surname_dmetaphone = \ FieldComparatorEncodeString(name = 'Surname D-Metaphone', fields_a = 'surname', fields_b = 'surname', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, encode_method = 'dmetaphone', reverse = False, frequency_table = surname_freq_table, freq_table_max_weight = 9.9, freq_table_min_weight = -4.3) locality_name_key = \ FieldComparatorKeyDiff(name = 'Locality name key diff', fields_a = 'locality_name', fields_b = 'locality_name', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, max_key_diff = 2) wayfare_name_winkler = \ FieldComparatorApproxString(name = 'Wayfare name Winkler', fields_a = 'wayfare_name', fields_b = 'wayfare_name', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, compare_method = 'winkler', min_approx_value = 0.7) postcode_distance = \ FieldComparatorDistance(name = 'Postcode distance', fields_a = 'postcode', fields_b = 'postcode', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, geocode_table = pc_geocode_table, max_distance = 50.0) age = FieldComparatorAge(name = 'Age', fields_a = ['dob_day', 'dob_month', 'dob_year'], fields_b = ['dob_day', 'dob_month', 'dob_year'], m_probability_day = 0.95, u_probability_day = 0.03333, m_probability_month = 0.95, u_probability_month = 0.083, m_probability_year = 0.95, u_probability_year = 0.01, max_perc_diff = 10.0, fix_date = 'today')
The defined field comparators can now be passed to a record comparator as shown in the following code block. Besides a list of field comparison functions, a record comparator must be given references to the two data sets to be compared. A check will be performed to ensure that the field names given to the field comparison functions (as in the above code blocks) are available in the given data sets.
# ==================================================================== # Define a record comparator using field comparison functions field_comparisons = [given_name_nysiis, surname_dmetaphone, locality_name_key, wayfare_name_winkler, postcode_distance, age] example_comparator = RecordComparator(tmpdata, tmpdata, field_comparisons)
The last thing that needs to be defined before a linkage or deduplication can be started is the definition of a classifier that classifies the vectors of weights as calculated by the field comparison functions above into links, non-links or possible links. The available classifiers are described in Section 9.5. The arguments given to the classifier are the two data sets, and in our example - a classical Fellegi and Sunter classifier - the values for the lower and upper thresholds.
# ==================================================================== # Define a classifier for classifing the matching vectors example_fs_classifier = \ FellegiSunterClassifier(name = 'Fellegi and Sunter', dataset_a = tmpdata, dataset_b = tmpdata, lower_threshold = 0.0, upper_threshold = 30.0)
Now that all necessary components have been defined and initialised, a
deduplication (or similarly a linkage) process can be started easily
by invoking the corresponding method of the project object. The
various defined components need to be given as arguments, as well as
the range of records (first record and number of records) to be
deduplicated, and the desired form of outputs. See
Section 9.6 for a detailed description of
how to start a deduplication or linkage process, and
Chapter 11 for descriptions of all possible output
forms (a new output format - comma-separated value (CSV) - can be
chosen by setting the file extension for attribute
'output_rec_pair_weights'
to '.csv'
, as shown in the
code below). It is also possible to save the raw comparison weight
vectors into a CSV text file for further processing and classification
(e.g. using advanced machine learning techniques), by setting the
attribute 'weight_vector_file'
to the name of a file. More
details about this can be found in
Section 9.6.
# ==================================================================== # Start a deduplication task myproject.deduplicate(input_dataset = indata, tmp_dataset = tmpdata, rec_standardiser = example_standardiser, rec_comparator = example_comparator, blocking_index = example_block_index, classifier = example_fs_classifier, first_record = 0, number_records = 5000, weight_vector_file = 'dedup-example-weight-vecs.csv', weight_vector_rec_field = 'rec_id', output_histogram = 'dedup-example-histogram.res', output_rec_pair_details = 'dedup-example-details.res', output_rec_pair_weights = 'dedup-example-weights.csv', output_threshold = 10.0, output_assignment = 'one2one') # ==================================================================== myfebrl.finalise() # ====================================================================
A Febrl project module is properly ended by a call to the
finalise()
method of your Febrl project object. This
will stop the parallel environment (if Febrl has been stated
in parallel) and make sure everything is properly shut down.
rec_standardiser
can be
set to None
, and no cleaning and standardisation is
performed. Instead, the records from the input data set are directly
copied into the temporary data set. Therefore, the input data set
and the temporary data set must have the same field name
definitions.
Once you have modified a project module according to your needs (let's
say you edited a file called myproject.py
), you can run it by
simply typing
python myproject.py