9.2 Field Comparison Functions

The heart of the record linkage process consists of the comparison of fields from individual records. These field comparison functions return the basic matching weights that are stored in a weight vector for each record pair that is compared. The weight vectors are then given to a classifier - like the classical Fellegi and Sunter [13] approach - to calculate a matching decision (match, non-match or possible match).

The Febrl system contains a number of different FieldComparator functions, implemented in the module comparison.py. The field comparators allow various comparisons of strings, numbers, dates, ages and times.

The following arguments need to be given to all field comparison functions when they are initialised:

dataset_a
A reference to a data set. Records taken from this data set will be compared to records from dataset_b.
dataset_b
A reference to a data set. This can be the same data set as dataset_a, for example if a deduplication is performed. Records taken from this data set will be compared to records from dataset_a.
fields_a
A single field name (as string) or a list of field names from dataset_a. If a list of field names is given, the comparison function will concatenate them into one single string (without whitespaces between the fields), and this string will then be compared with a string formed similarly using the fields from fields_b.
fields_b
A single field name (a string) or a list of field names from dataset_b. If a list of field names is given, the comparison function will concatenate them into one single string (without whitespaces between the fields), and this string will then be compared with a string formed similarly using the fields from fields_a.
missing_weight
A numerical value (floating-point number) that will be returned if one or more of the fields in fields_a or fields_b correspond to a missing value (as defined in a data set, see Chapter 13). The default value (i.e. if no argument missing_weight is given) for the missing value weight is zero.
M-probability
The probability that the two fields (or field lists) fields_a and fields_b are the same in matched record pairs. For most field comparison functions this is a numerical value given as argument m_probability. For the date and age comparators separate probabilities need to be given for day, month and year in the three arguments m_probability_day, m_probability_month and m_probability_year.
U-probability
The probability that two fields (or field lists) fields_a and fields_b are the same in un-matched record pairs. The argument names are similar to the ones for the M-probability.

The values for all M- and U-probabilities must be between 0.0 and 1.0. It is possible to re-set the probabilities for a field comparator at any time using the method set_probabilities().

The agreement and disagreement weights are computed using the M- and U-probabilities, as described in the record linkage literature (see for example [13,15,23,37]).

$\begin{eqnarray*} agree\_weight = log_2 \left( \frac{m\_probability} {u\_probability} \right) \end{eqnarray*}$

$\begin{eqnarray*} disagree\_weight = log_2 \left( \frac{1.0 - m\_probability} {1.0 - u\_probability} \right) \end{eqnarray*}$

Frequency look-up tables can be used with several of the field comparison functions, using the following optional arguments.

frequency_table
A reference to a frequency table (as defined and loaded using methods from the lookup.py module).
freq_table_max_weight
A numerical maximal weight value, which can be used to restrict weights computed using frequency tables to a certain limit (for rare entires in large tables weights can become very large and thus dominate other weights totally, which should be prevented).
freq_table_min_weight
A numerical minimal weight value, which can be used to restrict weights computed using frequency tables to a certain limit.

The calculation of frequency dependent weights is described in detail in Section 9.2.1 below.

Two additional arguments to all field comparison functions are name and description which can be used to document the functionality of a field comparator.