9.2 Field Comparison Functions

The heart of the record linkage process consists of the comparison of
fields from individual records. These field comparison functions
return the basic *matching weights* that
are stored in a *weight vector* for each
record pair that is compared. The weight vectors are then given to a
classifier - like the classical *Fellegi and
Sunter* [13] approach - to calculate a matching decision
(*match*, *non-match* or *possible match*).

The **Febrl** system contains a number of different
`FieldComparator` functions, implemented in the module
`comparison.py`. The field comparators
allow various comparisons of strings, numbers, dates, ages and times.

The following arguments need to be given to all field comparison functions when they are initialised:

`dataset_a`

A reference to a data set. Records taken from this data set will be compared to records from`dataset_b`

.`dataset_b`

A reference to a data set. This can be the same data set as`dataset_a`

, for example if a deduplication is performed. Records taken from this data set will be compared to records from`dataset_a`

.`fields_a`

A single field name (as string) or a list of field names from`dataset_a`

. If a list of field names is given, the comparison function will concatenate them into one single string (without whitespaces between the fields), and this string will then be compared with a string formed similarly using the fields from`fields_b`

.`fields_b`

A single field name (a string) or a list of field names from`dataset_b`

. If a list of field names is given, the comparison function will concatenate them into one single string (without whitespaces between the fields), and this string will then be compared with a string formed similarly using the fields from`fields_a`

.`missing_weight`

A numerical value (floating-point number) that will be returned if one or more of the fields in`fields_a`

or`fields_b`

correspond to a missing value (as defined in a data set, see Chapter 13). The default value (i.e. if no argument`missing_weight`

is given) for the missing value weight is zero.- M-probability

The probability that the two fields (or field lists)`fields_a`

and`fields_b`

are the same in matched record pairs. For most field comparison functions this is a numerical value given as argument`m_probability`

. For the date and age comparators separate probabilities need to be given for day, month and year in the three arguments`m_probability_day`

,`m_probability_month`

and`m_probability_year`

. - U-probability

The probability that two fields (or field lists)`fields_a`

and`fields_b`

are the same in un-matched record pairs. The argument names are similar to the ones for the M-probability.

The values for all M- and U-probabilities must be between 0.0 and 1.0.
It is possible to re-set the probabilities for a field comparator at
any time using the method `set_probabilities()`.

The
*agreement* and *disagreement* weights are computed using
the M- and U-probabilities, as described in the record linkage
literature (see for example [13,15,23,37]).

Frequency look-up tables can be used with several of the field comparison functions, using the following optional arguments.

`frequency_table`

A reference to a frequency table (as defined and loaded using methods from the`lookup.py`module).`freq_table_max_weight`

A numerical maximal weight value, which can be used to restrict weights computed using frequency tables to a certain limit (for rare entires in large tables weights can become very large and thus dominate other weights totally, which should be prevented).`freq_table_min_weight`

A numerical minimal weight value, which can be used to restrict weights computed using frequency tables to a certain limit.

The calculation of frequency dependent weights is described in detail in Section 9.2.1 below.

Two additional arguments to all field comparison functions are
`name`

and `description`

which can be used to document the
functionality of a field comparator.