Phonetic name encoding is traditionally used to create blocking variables in the record linkage process, but it can also be used to compare strings. Several algorithms for phonetic encoding are implemented in the encode.py module.
The encoded string comparison function compares the two fields (or field lists) given to it as encoded strings, and returns the agreement weight if both strings are encoded the same way, otherwise the disagreement weight is returned.
The encoding method has to be selected with the encode_method
argument. The following methods are currently implemented in
Febrl:
soundex
mod_soundex
nysiis
phonex
dmetaphone
All of these encodings are particularly sensitive to errors in the
first letter of a string. Therefore, an additional argument to the
encoded string comparator is reverse
which can be either set to
False
or True
. In the latter case, the strings are
reversed first before they are encoded. The default value for
reverse
is False
.
The maximum length of the codes calculated can be set with the
argument max_code_length
, which has a default value of 4.
If a frequency table is given for this field comparator, the agreement weight will be calculated using the frequency of the value in the input fields that are compared, as described in Section 9.2.1. Frequency dependent weights will also be used to calculate a partial agreement weight.