The input to the data cleaning routine is a string that contains an
input component, i.e. either a name or an address.
First, all letters in such a string are converted into lower case.
Then, a correction list of replacement strings is used to
replace certain words, abbreviations and characters with others. For
example, given the example correction list in
Table 6.1, variations of known as, such as
'a.k.a.'
or 'aka'
are all replaced with a
standard string 'known as'
. A correction list is loaded
from a correction list file (see
Section 14.1 for the details of the formats
of such files). Each entry in such a list is made of a string (that
can be one or more words, or a simple character) and a corresponding
replacement string. For each entry in the list, the input string is
scanned and if an original string is found it is replaced by the
corresponding replacement string.
Each correction list is sorted and processed by decreasing length of
the original string, i.e. long original strings are searched for and
replaced first. In the example correction list given below, the entry
' knownas '
would be searched first and if found it would be
replaced by ' known as '
. Note the spaces around some of the
entries. They are important, specially for short words, like '
na '
(not available). If the entry would be 'na'
only, each
occurrence of 'na'
in the input would be replaced by a single
space ' '
. The name 'bernadette'
would thus be converted
into 'ber dette'
.
The output of the data cleaning routine is a new string where all occurrences of substrings found in the correction list have been replaced with the corresponding replacement strings. Note that the length of the output string might be different from the input string.