9.2.4 Approximate String Comparison 'FieldComparatorApproxString'

9.2.4 Approximate String Comparison '`FieldComparatorApproxString`'

Approximate string comparison is an important feature for successful weight calculation when comparing strings from names and addresses. Instead of simply having an agreement or disagreement weight returned, approximate string comparators allow for partial agreement if strings are not exactly but almost the same, which can be due to typographical and other errors.

Various algorithms for approximate string comparisons have been developed, in both the statistical record linkage [21] and in the computer science and natural language processing communities. In the Febrl system, several approximate string comparison algorithms are implemented in the module stringcmp.py. All string comparison functions implemented in this module return a value between (two strings are completely different) and (two strings are the same).

The approximate string comparison method has to be selected with the compare_method argument. The following methods are currently implemented:

jaro
The Jaro [21,27] string comparator is commonly used in record linkage software. It computes the number of common characters in two strings, the lengths of both strings, and the number of transpositions to compute a similarity measure between and .
winkler
The Winkler comparator is based on the Jaro comparator but takes into account the fact that typographical errors occur more often towards the end of words, and thus gives an increased value to characters in agreement at the beginning of the strings. The partial agreement weight is therefore increased if the beginning of two strings is the same.
bigram
Bigrams are the two-character substrings in a string, for example 'peter' contains the bigrams 'pe', 'et', 'te' and 'er'. In the Bigram string comparator, the number of common bigrams in the two strings is counted and divided by the average number of bigrams in the two strings, to calculate a similarity measure between 0.0 and 1.0.
editdist
The Edit distance algorithm (also known as the Levenshtein distance) counts the minimum number of deletions, transpositions and insertions that have to be made to transform one string into the other. This number is then divided by the length of the longer string to get a similarity measure between 0.0 and 1.0.
seqmatch
This approximate string comparator is implemented in the Python standard library difflib. It is based on an algorithm developed by Ratcliff and Obershelp in the 1980s, and uses pattern matching to compute a similarity measure between 0.0 and 1.0.

A second argument that needs to be given to the approximate string comparator is min_approx_value (a number between and ), which is the minimal approximate string similarity measure tolerated.

If the two strings are the same (i.e. if the similarity measure returned by the approximate string comparator is ), the agreement weight is returned. If the value is less than but larger or equal to min_approx_value, then a partial agreement weight is calculated using the following formula.

$\begin{eqnarray*} partial\_agreement & = & agree\_weight - \left( \ \frac{1.0 - ... ...ue} \right) * \\ ~& ~& (agree\_weight + abs(disagree\_weight)) \end{eqnarray*}$

If the returned value is smaller than min_approx_value the disagreement weight will be returned.

If a frequency table is given for this field comparator, both agreement and disagreement weights will be calculated using the frequencies of the values in the input fields that are compared, as described in Section 9.2.1.