9.2.1 Frequency Dependent Weight Calculation

If a frequency table is given for a certain field comparator that supports frequency dependent weight calculation, both agreement and disagreement weight will be calculated using the frequencies of the values of the input fields that are compared, if they are found in the frequency table.

Given a field value is listed in the given frequency table, its count and the sum of all entries in the frequency table are used to compute the frequency probability of this entry.

$\begin{eqnarray*} value\_freq\_prob = \frac{value\_count}{\sum value\_count} \end{eqnarray*}$

$\begin{eqnarray*} agree\_weight = log_2 \left( \frac{1.0}{value\_freq\_prob} \right) \end{eqnarray*}$

$\begin{eqnarray*} disagree\_weight = log_2 \left( \frac{1.0-value\_freq\_prob} {1.0 - {value\_freq\_prob}^2} \right) \end{eqnarray*}$

If a value is not found in a frequency table, the M- and U-probabilities are used to compute generic agreement and disagreement weights as described in Section 9.2.

For each field value (one from each record) we now have an agreement and a disagreement weight, and the minimum of the two agreement weights will be selected if the values are the same, and the maximum of the disagreement weight if the two values differ. Partial agreement weights are then calculated as described in the sections below.

Note: The choice of using the minimum for agreement weight and maximum for disagreement weight is not based on a sound theory, rather the authors didn't find enough evidence in the record linkage literature of how to calculate the weights if the values partially agree, and if they have different agreement and disagreement weights.

The following sections contain descriptions of the field comparison functions currently provided by Febrl. Improved and additional functions will be added in later versions of this software.