Approximate string comparison is an important feature for successful weight calculation when comparing strings from names and addresses. Instead of simply having an agreement or disagreement weight returned, approximate string comparators allow for partial agreement if strings are not exactly but almost the same, which can be due to typographical and other errors.

Various algorithms for approximate string comparisons have been
developed, in both the statistical record linkage [30] and in
the computer science and natural language processing communities. In
the **Febrl** system, several approximate string comparison
algorithms are implemented in the module
`stringcmp.py`. All string comparison functions implemented in
this module return a value between (two strings are completely
different) and (two strings are the same).

The approximate string comparison method has to be selected with the
`compare_method`

argument. The following methods are currently
implemented:

`jaro`

The*Jaro*[30,37] string comparator is commonly used in record linkage software. It computes the number of common characters in two strings, the lengths of both strings, and the number of transpositions to compute a similarity measure between and .`winkler`

The*Winkler*comparator is based on the*Jaro*comparator but takes into account the fact that typographical errors occur more often towards the end of words, and thus gives an increased value to characters in agreement at the beginning of the strings. The partial agreement weight is therefore increased if the beginning of two strings is the same.`bigram`

Bigrams are the two-character substrings in a string, for example`'peter'`

contains the bigrams`'pe'`

,`'et'`

,`'te'`

and`'er'`

. In the*Bigram*string comparator, the number of common bigrams in the two strings is counted and divided by the average number of bigrams in the two strings, to calculate a similarity measure between 0.0 and 1.0.`editdist`

The edit distance (also known as*Levenshtein distance*) algorithm counts the minimum number of character deletions, transpositions and insertions that have to be made to transform one string into the other. This number is then divided by the length of the longer string to get a similarity measure between 0.0 and 1.0.`bagdist`

The bag distance is a cheap distance measure [2] which always returns a distance smaller or equal to the edit distance. For two given strings and this distance can be calculated in , while the edit distance takes to calculate. The bag distance is then divided by the length of the longer string to get a similarity measure between 0.0 and 1.0. Due to this transformation the bag distance similarity measure will always be larger than the edit distance similarity measure.`seqmatch`

This approximate string comparator is implemented in the Python standard library`difflib`. It is based on an algorithm developed by*Ratcliff*and*Obershelp*in the 1980s, and uses pattern matching to compute a similarity measure between 0.0 and 1.0.`compression`

Using compression algorithms to calculate similarity between strings has recently been proposed [9,18] and used for various data mining algorithms, including clustering and time series analysis. A similarity measure between two strings and is calculated as follows.

with being the compression function available in the**Python**module`zlib`.`sortwinkler`

If a string contains more than one word (i.e. it contains at least one whitespace), then the words are first sorted before a standard*Winkler*comparison is applied. The idea is that - unless there are errors in the first few letters of a word - sorting of swapped words will bring them into the same order, thereby improving the similarity value.`permwinkler`

Based on the experience that swapped words result in low approximate similarity measure for most string comparators, the*permutation Winkler*creates all possible permutations of words (if there are more than one word in an input string, and then calculates the similarity measures - using the standard*Winkler*comparator discussed above - for all permutations. The maximum calculated similarity measure is returned.

A second argument that needs to be given to the approximate string
comparator is `min_approx_value`

(a number between and
), which is the minimal approximate string similarity measure
tolerated.

If the two strings are the same (i.e. if the similarity measure
returned by the approximate string comparator is ), the agreement
weight is returned. If the value is less than but larger or
equal to `min_approx_value`

, then a partial agreement weight is
calculated using the following formula.

If the returned value is smaller than `min_approx_value`

the
disagreement weight will be returned.

If a frequency table is given for this field comparator, the agreement weight will be calculated using the frequency of the value in the input fields that are compared, as described in Section 9.2.1. Frequency dependent weights will also be used to calculate a partial agreement weight.