The heart of the record linkage process consists of the comparison of fields from individual records. These field comparison functions return the basic matching weights that are stored in a weight vector for each record pair that is compared. The weight vectors are then given to a classifier - like the classical Fellegi and Sunter [13] approach - to calculate a matching decision (match, non-match or possible match).
The Febrl system contains a number of different FieldComparator functions, implemented in the module comparison.py. The field comparators allow various comparisons of strings, numbers, dates, ages and times.
The following arguments need to be given to all field comparison functions when they are initialised:
dataset_a
dataset_b
.
dataset_b
dataset_a
, for example if a deduplication is performed.
Records taken from this data set will be compared to records
from dataset_a
.
fields_a
dataset_a
. If a list of field names is given, the
comparison function will concatenate them into one single string
(without whitespaces between the fields), and this string will
then be compared with a string formed similarly using the fields
from fields_b
.
fields_b
dataset_b
. If a list of field names is given, the
comparison function will concatenate them into one single string
(without whitespaces between the fields), and this string will
then be compared with a string formed similarly using the fields
from fields_a
.
missing_weight
fields_a
or
fields_b
correspond to a missing value (as defined in a
data set, see Chapter 13). The default value
(i.e. if no argument missing_weight
is given) for the
missing value weight is zero.
fields_a
and fields_b
are the same in matched
record pairs. For most field comparison functions this is a
numerical value given as argument m_probability
. For the
date and age comparators separate probabilities need to be given
for day, month and year in the three arguments
m_probability_day
, m_probability_month
and
m_probability_year
.
fields_a
and fields_b
are the same in un-matched
record pairs. The argument names are similar to the ones for the
M-probability.
The values for all M- and U-probabilities must be between 0.0 and 1.0. It is possible to re-set the probabilities for a field comparator at any time using the method set_probabilities().
The agreement and disagreement weights are computed using the M- and U-probabilities, as described in the record linkage literature (see for example [13,15,23,37]).
Frequency look-up tables can be used with several of the field comparison functions, using the following optional arguments.
frequency_table
freq_table_max_weight
freq_table_min_weight
The calculation of frequency dependent weights is described in detail in Section 9.2.1 below.
Two additional arguments to all field comparison functions are
name
and description
which can be used to document the
functionality of a field comparator.