Approximate string comparison is an important feature for successful weight calculation when comparing strings from names and addresses. Instead of simply having an agreement or disagreement weight returned, approximate string comparators allow for partial agreement if strings are not exactly but almost the same, which can be due to typographical and other errors.
Various algorithms for approximate string comparisons have been
developed, in both the statistical record linkage [21] and in
the computer science and natural language processing communities. In
the Febrl system, several approximate string comparison
algorithms are implemented in the module
stringcmp.py. All string comparison functions implemented in
this module return a value between (two strings are completely
different) and
(two strings are the same).
The approximate string comparison method has to be selected with the
compare_method
argument. The following methods are currently
implemented:
jaro
winkler
bigram
'peter'
contains the bigrams 'pe'
,
'et'
, 'te'
and 'er'
. In the Bigram
string comparator, the number of common bigrams in the two
strings is counted and divided by the average number of bigrams
in the two strings, to calculate a similarity measure between
0.0 and 1.0.
editdist
seqmatch
A second argument that needs to be given to the approximate string
comparator is min_approx_value
(a number between and
), which is the minimal approximate string similarity measure
tolerated.
If the two strings are the same (i.e. if the similarity measure
returned by the approximate string comparator is ), the agreement
weight is returned. If the value is less than
but larger or
equal to
min_approx_value
, then a partial agreement weight is
calculated using the following formula.
If the returned value is smaller than min_approx_value
the
disagreement weight will be returned.
If a frequency table is given for this field comparator, both agreement and disagreement weights will be calculated using the frequencies of the values in the input fields that are compared, as described in Section 9.2.1.