9.2.1 Frequency Dependent Weight Calculation

If a frequency table is given for a certain field comparator that supports frequency dependent weight calculation, the agreement weight will be calculated using the frequency of the value of the input fields that are compared, if they are found in the frequency table.

If a field value is found in the given frequency table, its count and the sum of all entries in the frequency table are used to compute the frequency probability of this entry.

value\_freq\_prob = \frac{count[value]}{\sum count[i]}

The agreement weight is then calculated using the following formula.

agree\_weight = log_2 \left( \frac{1.0}{value\_freq\_prob} \right)

Note that a frequency dependent weight is only calculated if the values agree exactly (i.e. when they are the same), otherwise the generic disagreement weight is used.

If the attribute freq_table_max_weight (see Section 9.2 above) is set and the calculated agreement weight is larger than this value, it is limited to the value of freq_table_max_weight.

If a value is not found in a frequency table, the M- and U-probabilities are used to compute generic agreement and disagreement weights as described in Section 9.2.

If the field values differ, agreement and disagreement weights are still calculated (and then used to calculate partial agreement weights as described in the following sections). While the disagreement weight is never calculated using frequency tables, frequency dependent agreement weights will be calculated if a frequency table is available and the values are found in this table.

Different input values might have different frequencies, resulting in different agreement weights as shown in the above formula. We select the minimum of the two frequency dependent agreement weights, as well as the generic disagreement weight, to calculate partial agreement weights as described in the sections below.

Note: The choice of using the minimum for agreement weight is not based on a sound theory, rather the authors didn't find enough evidence in the record linkage literature of how to calculate the weights if the values partially agree, and if they have different agreement weights.

The following sections contain descriptions of the field comparison functions currently provided by Febrl. Improved and additional functions will be added in later versions of this software.