All aspects of the Febrl system are configured and
controlled by a single Python module (program) that can be derived
from either the project-standardise.py,
project-linkage.py or project-deduplicate.py modules
provided. In this section the complete project-deduplicate.py
module as supplied with the current Febrl distribution is
described in detail using blocks of code extracted from this module.
The module project-linkage.py is very similar in its
structure, with the main difference that two data sets are dealt with,
while the module project-standardise.py only contains the
definitions for one input and one output data set plus the necessary
standardisers, but no linkage or deduplication processes are defined.
Each code block is explained and references to the relevant chapters
are given. It is assumed that the reader has some familiarity with the
(very simple) syntax of the Python programming language in which
Febrl is implemented. If not, the necessary knowledge can be
gained in just a few hours from one of the tutorials or introductions
listed on the Python Web site at http://www.python.org. Note
that comments in the Python language start with a hash character
(#
) and continue until the end of a line. Alternatively a
comment can go over several lines if it is enclosed by triple quotes
"""
as shown in the code below for the module name
documentation string.
At the top of the project-deduplicate.py module is the header with version and licensing information, followed by a documentation string that gives the name of the module and a short description.
# ==================================================================== # project-deduplicate.py - Configuration for a deduplication project. # # Freely extensible biomedical record linkage (Febrl) Version 0.2.2 # See http://datamining.anu.edu.au/projects/linkage.html # # ==================================================================== # AUSTRALIAN NATIONAL UNIVERSITY OPEN SOURCE LICENSE (ANUOS LICENSE) # VERSION 1.1 # # The contents of this file are subject to the ANUOS License Version # 1.1 (the "License"); you may not use this file except in compliance # with the License. Software distributed under the License is # distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, # either express or implied. See the License for the specific language # governing rights and limitations under the License. # The Original Software is "project-deduplicate.py". # The Initial Developers of the Original Software are Dr Peter # Christen (Department of Computer Science, Australian National # University) and Dr Tim Churches (Centre for Epidemiology and # Research, New South Wales Department of Health). Copyright (C) 2002, # 2003 the Australian National University and others. All Rights # Reserved. # Contributors: # # ==================================================================== """Module project-deduplicate.py - Configuration for a deduplication project Briefly, what needs to be defined for a deduplication project is: - A Febrl object, a project, plus a project logger - One input data set - One corresponding temporary data set (with 'readwrite' access) - Lookup tables to be used - Standardisers for names, addresses and dates - Field comparator functions and a record comparator - A blocking index - A classifier and then the 'deduplicate' method can be called. This project module will standardised and then deduplicate the example data set 'dataset2.csv' given in the 'dbgen' directory. """
In the following code block all the required Febrl modules are imported so the necessary functionality is available.
# ==================================================================== # Imports go here import sys # Python system modules needed import time from febrl import * # Main Febrl classes from dataset import * # Data set routines from standardisation import * # Standardisation routines from comparison import * # Comparison functions from lookup import * # Look-up table routines from indexing import * # Indexing and blocking routines from simplehmm import * # Hidden Markov model (HMM) routines from classification import * # Classifiers for weight vectors
Next the system is initialised by creating a Febrl object
myfebrl
, and a new project is initialised as part of it. This
project is assigned a name, a description and a file name (into
which it can be saved at any time). Febrl loads records
block wise from the input data sets, and the argument
block_size
controls how many records should be loaded into one
such block.
Note that you can use names other than myfebrl
and
myproject
if you wish - just be sure to use the same name
throughout. In fact, you can configure multiple Febrl
projects from one project.py module if you wish, using
different variable names to refer to each project object. For the
sake of simplicity, we will only configure one project here.
# ==================================================================== # Set up Febrl and create a new project (or load a saved project) myfebrl = Febrl(description = 'Example Febrl instance', febrl_path = '.') myproject = myfebrl.new_project(name = 'example-dedup', description = 'Deduplicate example data set', file_name = 'example-deduplicate.fbr', block_size = 1000, parallel_write = 'host')
The argument block_size
sets the number of records that will
be loaded and processed in a block because Febrl is working
in a blocking fashion. Note that this block size has nothing to do
with the blocking as used within the record linkage process. Rather,
records are loaded from files in chunks (blocks) and processed, before
the next block is loaded and processed.
The parallel_write
argument sets the way in which data sets
are written to files when Febrl is run in parallel. If set
to 'host'
only the host process (the one Febrl has
been started on) writes into data sets. On the other hand, if set to
'all'
then all processes write into local files. See
Chapter 12 for more detailed information on this
topic.
To enable verbose output and logging of status, warning and error
messages into a log file, a project logger is defined next. The
verbose and log output levels can be set independently from each other
to values 0
(no output), 1
(only summary output),
2
(extended summary output) or 3
(very detailed output
at record level). Similar to the parallel_write
argument
discussed above, parallel_print
(which can be set to
'host'
or 'all'
) defines the way output is logged and
printed. For more details on verbose output and logging please see
Chapter 14.
# ==================================================================== # Define a project logger mylog = ProjectLog(file_name = 'example-dedup.log', project = myproject, log_level = 1, verbose_level = 2, clear_log = True, no_warn = False, parallel_print = 'host')
The next two code blocks define an input data set (initialised for reading from a CSV file) and a temporary memory based data set (to hold the cleaned and standardised records before they are deduplicated or linked and written into an output data set). See Chapter 12 for more information on the given arguments, and how to access data sets. Note that this initialisation of data sets does not mean that they are immediately loaded, but only that preparations are made to access them. Note also that fields defined in the input data set do not need to be all the columns in the data file or database table being accessed (if for example columns are not needed in a standardisation and linkage process).
# ==================================================================== # Define your original input data set(s) # Only one data set is needed for deduplication indata = DataSetCSV(name = 'example2in', description = 'Example data set 2', access_mode = 'read', header_lines = 1, file_name = './dbgen/dataset2.csv', fields = {'rec_id':0, 'given_name':1, 'surname':2, 'street_num':3, 'address_part_1':4, 'address_part_2':5, 'suburb':6, 'postcode':7, 'state':8, 'date_of_birth':9, 'soc_sec_id':10}, fields_default = '', strip_fields = True, missing_values = ['','missing'])
# ==================================================================== # Define a temporary data set tmpdata = DataSetMemory(name = 'example2tmp', description = 'Temporary example 2 data set', access_mode = 'readwrite', fields = {'title':1, 'gender_guess':2, 'given_name':3, 'alt_given_name':4, 'surname':5, 'alt_surname':6, 'wayfare_number':7, 'wayfare_name':8, 'wayfare_qualifier':9, 'wayfare_type':10, 'unit_number':11, 'unit_type':12, 'property_name':13, 'institution_name':14, 'institution_type':15, 'postaddress_number':16, 'postaddress_type':17, 'locality_name':18, 'locality_qualifier':19, 'postcode':20, 'territory':21, 'country':22, 'dob_day':23, 'dob_month':24, 'dob_year':25, # The following are fields that are passed without standardisation 'rec_id':0, 'soc_sec_id':26, # The last output field contains the probability of the address HMM 'address_hmm_prob':27, }, missing_values = ['','missing'])
Various types of look-up tables are loaded from their files in the next code block. Not that for example several name tagging look-up table files are loaded into one name tagging look-up table. More details on the file format of these look-up tables and correction lists can be found in Chapter 13.
# ==================================================================== # Define and load lookup tables name_lookup_table = TagLookupTable(name = 'Name lookup table', default = '') name_lookup_table.load(['./data/givenname_f.tbl', './data/givenname_m.tbl', './data/name_prefix.tbl', './data/name_misc.tbl', './data/saints.tbl', './data/surname.tbl', './data/title.tbl']) name_correction_list = CorrectionList(name = 'Name correction list') name_correction_list.load('./data/name_corr.lst') address_lookup_table = TagLookupTable(name = 'Address lookup table', default = '') address_lookup_table.load(['./data/country.tbl', './data/address_misc.tbl', './data/address_qual.tbl', './data/institution_type.tbl', './data/post_address.tbl', './data/saints.tbl', './data/territory.tbl', './data/unit_type.tbl', './data/wayfare_type.tbl']) addr_correction_list = CorrectionList(name = 'Address corr. list') addr_correction_list.load('./data/address_corr.lst') pc_geocode_table = GeocodeLookupTable(name = 'Postcode centroids', default = []) pc_geocode_table.load('./data/postcode_centroids.csv')
In this following code block two hidden Markov models (HMMs) are defined (first their states and then the observations, i.e. tags) and loaded from files. These HMMs are used by the name and address component standardisation processes. Further information on HMMs is given in Chapters 7 and 8.
# ==================================================================== # Define and load hidden Markov models (HMMs) name_states = ['titl','baby','knwn','andor','gname1','gname2','ghyph', 'gopbr','gclbr','agname1','agname2','coma','sname1', 'sname2','shyph','sopbr','sclbr','asname1','asname2', 'pref1','pref2','rubb'] name_tags = ['NU','AN','TI','PR','GF','GM','SN','ST','SP','HY','CO', 'NE','II','BO','VB','UN','RU'] myname_hmm = hmm('Name HMM', name_states, name_tags) myname_hmm.load_hmm('./hmm/name-absdiscount.hmm') address_states = ['wfnu','wfna1','wfna2','wfql','wfty','unnu','unty', 'prna1','prna2','inna1','inna2','inty','panu', 'paty','hyph','sla','coma','opbr','clbr','loc1', 'loc2','locql','pc','ter1','ter2','cntr1','cntr2', 'rubb'] address_tags = ['PC','N4','NU','AN','TR','CR','LN','ST','IN','IT', 'LQ','WT','WN','UT','HY','SL','CO','VB','PA','UN', 'RU'] myaddress_hmm = hmm('Address HMM', address_states, address_tags) myaddress_hmm.load_hmm('./hmm/address-absdiscount.hmm')
Next follows a list of possible date formats needed to parse date strings in the data standardisation process. See Section 6.6 for more details on these format strings.
# ==================================================================== # Define a list of date parsing format strings date_parse_formats = ['%d %m %Y', # 24 04 2002 or 24 4 2002 '%d %B %Y', # 24 Apr 2002 or 24 April 2002 '%m %d %Y', # 04 24 2002 or 4 24 2002 '%B %d %Y', # Apr 24 2002 or April 24 2002 '%Y %m %d', # 2002 04 24 or 2002 4 24 '%Y %B %d', # 2002 Apr 24 or 2002 April 24 '%Y%m%d', # 20020424 ** ISO standard ** '%d%m%Y', # 24042002 '%m%d%Y', # 04242002 '%d %m %y', # 24 04 02 or 24 4 02 '%d %B %y', # 24 Apr 02 or 24 April 02 '%y %m %d', # 02 04 24 or 02 4 24 '%y %B %d', # 02 Apr 24 or 02 April 24 '%m %d %y', # 04 24 02 or 4 24 02 '%B %d %y', # Apr 24 02 or April 24 02 '%y%m%d', # 020424 '%d%m%y', # 240402 '%m%d%y', # 042402 ]
Now the different standardisation processes can be defined. Each needs input fields and output fields. It is assumed that these input fields are defined in the input data set as defined in the record standardiser further below. Similarly, output fields need to be defined in the output data set of the record standardiser. In this example the output data set of the record standardiser is the temporary data set defined earlier. This temporary data set will later be used as input for the deduplication process.
Each of the component standardiser (for dates, names or addresses) has special arguments which are explained in details in Chapter 6. It is possible to define more than one standardiser, for example if several dates need to be cleaned and standardised, or if more than one name component or several addresses are available in a record.
# ==================================================================== # Define standardisers for dates dob_std = DateStandardiser(name = 'DOB-std', description = 'Date of birth standardiser', input_fields = 'date_of_birth', output_fields = ['dob_day','dob_month', 'dob_year'], parse_formats = date_parse_formats)
Name cleaning and standardisation can be done using a rule-based approach or a hidden Markov model (HMM) approach. The following two code blocks show definitions for both a rule-based and a HMM based name standardisers. For more details on HMM standardisation see Chapter 7, while the name standardisation is explained in Section 6.3 using rules and in Section 6.4 for the HMM based approach.
# ==================================================================== # Define a standardiser for names based on rules name_rules_std = NameRulesStandardiser(name = 'Name-Rules', input_fields = ['given_name', 'surname'], output_fields = ['title', 'gender_guess', 'given_name', 'alt_given_name', 'surname', 'alt_surname'], name_corr_list = name_correction_list, name_tag_table = name_lookup_table, male_titles = ['mr'], female_titles = ['ms'], field_separator = ' ', check_word_spill = True)
# ==================================================================== # Define a standardiser for name based on HMM name_hmm_std = NameHMMStandardiser(name = 'Name-HMM', input_fields = ['given_name','surname'], output_fields = ['title', 'gender_guess', 'given_name', 'alt_given_name', 'surname', 'alt_surname'], name_corr_list = name_correction_list, name_tag_table = name_lookup_table, male_titles = ['mr'], female_titles = ['ms'], name_hmm = myname_hmm, field_separator = ' ', check_word_spill = True)
For addresses, currently only a HMM based standardisation approach is available. See Section 6.5 for more details and a description of all the possible arguments for this standardiser.
# ==================================================================== # Define a standardiser for address based on HMM address_hmm_std = AddressHMMStandardiser(name = 'Address-HMM', input_fields = ['street_num', 'address_part_1', 'address_part_2', 'suburb', 'postcode', 'state'], output_fields = ['wayfare_number', 'wayfare_name', 'wayfare_qualifier', 'wayfare_type', 'unit_number', 'unit_type', 'property_name', 'institution_name', 'institution_type', 'postaddress_number', 'postaddress_type', 'locality_name', 'locality_qualifier', 'postcode', 'territory', 'country', 'address_hmm_prob'], address_corr_list = addr_correction_list, address_tag_table = addr_lookup_table, field_separator = ' ', address_hmm = myaddress_hmm)
In many data sets there are input fields that are already in a cleaned
and standardised form, and which therefore can directly be copied into
corresponding output fields (so they are available for a linkage or
deduplication process later). The simple PassFieldStandardiser
as described in Section 6.7 can be used for
this. It copies the values from the input fields into the
corresponding output fields without any modifications. In the example
given below, values from the input field 'rec_id'
are copied
into the output field of the same name, and values from input
field 'soc_sec_id'
are copied into an output field of the
same name.
# ==================================================================== pass_fields = PassFieldStandardiser(name = 'Pass fields', input_fields = ['rec_id', 'soc_sec_id'], output_fields = ['rec_id', 'soc_sec_id'])
Now that the component standardisers are defined, they can be passed on to a record standardiser, which defines the input and output data sets for the data cleaning and standardisation process. A check is made to determine if the input and output fields in all the component standardisers defined above are available in the input or output data sets defined in the record standardiser.
As can be seen in the comp_stand
list, not all defined
component standardisers need to be passed to the record standardiser.
Only the ones listed here are actually used for the standardisation
process. It is very easy to change the standardisation process by
simple changing the component standardisers passed to the record
standardiser.
# ==================================================================== # Define a record standardiser comp_stand = [dob_std, name_rules_std, address_hmm_std, pass_fields] # The HMM based name standardisation is not used in this example # standardiser, uncomment the lines below (and comment the ones above) # to use HMM standardisation for names. # #comp_stand = [dob_std, name_hmm_std, address_hmm_std, pass_fields] example_standardiser = RecordStandardiser(name = 'Example-std', description = 'Exam. standardiser', input_dataset = indata, output_dataset = tmpdata, comp_std = comp_stand)
In the next few code blocks definitions for the record linkage (or deduplication) process are given. First, a blocking index is defined with three different indexes on various field combinations and encoding methods. Febrl contains various indexing methods, including the standard blocking, sorted neighbourhood and an experimental bigram index. In the code shown below indexes for all three methods are defined but only one will later be used in the definition of the deduplication process (in this example the blocking index). Note that for a linkage process the index methods used must be the same for both data sets (i.e. it is not possible to use a blocking index for one and data set a sorted neighbourhood index for the other). See Section 9.1 for more details on the various indexing methods and their functionalities.
# ==================================================================== # Define blocking index(es) (one per temporary data set) myblock_def = [[('surname','dmetaphone', 4),('dob_year','direct')], [('given_name','truncate', 3), ('postcode','direct')], [('dob_month','direct'),('locality_name','nysiis')], ] # Define one or more indexes (to be used in the classifier further # below) example_block_index = BlockingIndex(name = 'Index-blocking', dataset = tmpdata, index_def = myblock_def) example_sorting_index = SortingIndex(name = 'Index-sorting', dataset = tmpdata, index_def = myblock_def, window_size = 3) example_bigram_index = BigramIndex(name = 'Index-bigram', dataset = tmpdata, index_def = myblock_def, threshold = 0.75)
Similar to the definition of component standardisers earlier, we now have to define field comparison functions which will be used to compare the selected fields in the cleaned data set in order to calculate the matching weight vector. Section 9.2 describes all the field comparison functions available. The following code block shows different examples of field comparator functions.
# ==================================================================== # Define comparison functions for linkage given_name_nysiis = \ FieldComparatorEncodeString(name = 'Given name NYSIIS', fields_a = 'given_name', fields_b = 'given_name', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, encode_method = 'nysiis', reverse = False) surname_dmetaphone = \ FieldComparatorEncodeString(name = 'Surname D-Metaphone', fields_a = 'surname', fields_b = 'surname', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, encode_method = 'dmetaphone', reverse = False) locality_name_key = \ FieldComparatorKeyDiff(name = 'Locality name key diff', fields_a = 'locality_name', fields_b = 'locality_name', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, max_key_diff = 2) wayfare_name_winkler = \ FieldComparatorApproxString(name = 'Wayfare name Winkler', fields_a = 'wayfare_name', fields_b = 'wayfare_name', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, compare_method = 'winkler', min_approx_value = 0.7) postcode_distance = \ FieldComparatorDistance(name = 'Postcode distance', fields_a = 'postcode', fields_b = 'postcode', m_prob = 0.95, u_prob = 0.001, missing_weight = 0.0, geocode_table = pc_geocode_table, max_distance = 50.0)
age = FieldComparatorAge(name = 'Age', fields_a = ['dob_day', 'dob_month', 'dob_year'], fields_b = ['dob_day', 'dob_month', 'dob_year'], m_probability_day = 0.95, u_probability_day = 0.03333, m_probability_month = 0.95, u_probability_month = 0.083, m_probability_year = 0.95, u_probability_year = 0.01, max_perc_diff = 10.0, fix_date = 'today')
The defined field comparators can now be passed to a record comparator as shown in the following code block. Besides a list of field comparison functions, a record comparator must be given references to the two data sets to be compared. A check will be performed to ensure that the field names given to the field comparison functions (as in the above code blocks) are available in the given data sets.
# ==================================================================== # Define a record comparator using field comparison functions field_comparisons = [given_name_nysiis, surname_dmetaphone, locality_name_key, wayfare_name_winkler, postcode_distance, age] example_comparator = RecordComparator(tmpdata, tmpdata, field_comparisons)
The last thing that needs to be defined before a linkage or deduplication can be started is the definition of a classifier that classifies the vectors of weights as calculated by the field comparison functions above into links, non-links or possible links. The available classifiers are described in Section 9.5. The arguments given to the classifier are the two data sets, and in our example - a classical Fellegi and Sunter classifier - the values for the lower and upper thresholds.
# ==================================================================== # Define a classifier for classifing the matching vectors example_fs_classifier = \ FellegiSunterClassifier(name = 'Fellegi and Sunter', dataset_a = tmpdata, dataset_b = tmpdata, lower_threshold = 0.0, upper_threshold = 30.0)
Now that all necessary components have been defined and initialised, a deduplication (or similarly a linkage) process can be started easily by invoking the corresponding method of the project object. The various defined components need to be given as arguments, as well as the range of records (first record and number of records) to be deduplicated, and the desired form of outputs. See Section 9.6 for a detailed description of how to start a deduplication or linkage process, and Chapter 10 for descriptions of all possible output forms.
# ==================================================================== # Start a deduplication task myproject.deduplicate(input_dataset = indata, tmp_dataset = tmpdata, rec_standardiser = example_standardiser, rec_comparator = example_comparator, blocking_index = example_block_index, classifier = example_fs_classifier, first_record = 0, number_records = 5000, output_histogram = 'dedup-example-histogram.res', output_rec_pair_details = 'dedup-example-details.res', output_rec_pair_weights = 'dedup-example-weights.res', output_threshold = 10.0, output_assignment = 'one2one') # ==================================================================== myfebrl.finalise() # ====================================================================
A Febrl project module is properly ended by a call to the
finalise()
method of your Febrl project object. This
will stop the parallel environment (if Febrl has been stated
in parallel) and make sure everything is properly shut down.
rec_standardiser
can be
set to None
, and no cleaning and standardisation is
performed. Instead, the records from the input data set are directly
copied into the temporary data set. Therefore, the input data set
and the temporary data set must have the same field name
definitions.
Once you have modified a project module according to your needs (lets
say you edited a file called myproject.py
), you can run it by
simply typing
python myproject.py