9.2.4 Approximate String Comparison  'FieldComparatorApproxString'

Approximate string comparison is an important feature for successful weight calculation when comparing strings from names and addresses. Instead of simply having an agreement or disagreement weight returned, approximate string comparators allow for partial agreement if strings are not exactly but almost the same, which can be due to typographical and other errors.

Various algorithms for approximate string comparisons have been developed, in both the statistical record linkage [30] and in the computer science and natural language processing communities. In the Febrl system, several approximate string comparison algorithms are implemented in the module stringcmp.py. All string comparison functions implemented in this module return a value between $0.0$ (two strings are completely different) and $1.0$ (two strings are the same).

The approximate string comparison method has to be selected with the compare_method argument. The following methods are currently implemented:

A second argument that needs to be given to the approximate string comparator is min_approx_value (a number between $0.0$ and $1.0$), which is the minimal approximate string similarity measure tolerated.

If the two strings are the same (i.e. if the similarity measure returned by the approximate string comparator is $1.0$), the agreement weight is returned. If the value is less than $1.0$ but larger or equal to min_approx_value, then a partial agreement weight is calculated using the following formula.

\begin{eqnarray*}
partial\_agreement & = & agree\_weight - \left(
\frac{1.0 - s...
...ue} \right) * \\
~& ~& (agree\_weight + abs(disagree\_weight))
\end{eqnarray*}


If the returned value is smaller than min_approx_value the disagreement weight will be returned.

If a frequency table is given for this field comparator, the agreement weight will be calculated using the frequency of the value in the input fields that are compared, as described in Section 9.2.1. Frequency dependent weights will also be used to calculate a partial agreement weight.