Correction list files contain characters or strings and their
corresponding corrections (replacements). They are converted into
Python lists which are used in the initial data cleaning step to
replace a character or string with the corresponding replacement.
Correction list files should have a file extension '.lst'
. The
format of these files is as follows:
:=
values
'#'
character.
# ==================================================================== # Remove characters and words from input (replace with single space) ' ' := '.', '?', '~', '_', ':', ';', '^', '=', ' na ', ' n/a ', ' n.a.', '\', ' also ', ' name ', ' only ', ' abbrev ', ' initials ', ' unk ', ' unkn ', ' missing ', ' unknown ' # Correct words and symbols ' and ' := '+', '&' ' baby ' := ' babe ' ' baby of ' := ' babyof ', ' babeof ', ' b/o ', ' b.o.' ' known as ' := ' knownas ', ' a.k.a. ' # Remove ' from o'brian etc ' o' := " o'" ' a' := " a'"
In the above example, all values in the first entry (an entry that
goes over four lines) are replaced with a single space ' '
. It
is important that, for example, the value ' na '
starts with a
space and ends with a space. Assuming these spaces were omitted, i.e. the value would be 'na'
, then each occurrence of the string
'na'
in the input would be replaced by a single space. The word
'annabella'
would thus be replaced with 'an bella'
which
is not what is wanted.
The list of value and replacement pairs is internally
sorted with decreasing length of the values. Long value
strings are therefore replaced before shorter strings or characters
are replaced. In this way, the value ' a.k.a. '
is replaced
with ' known as '
before full-stops (periods) '.'
are
replaced by a space ' '
.
Assuming the lookup.py module has been imported using the
import lookup
command, an example correction list can be
initialised and loaded as shown in the following example.
# ==================================================================== address_correction_list = lookup.CorrectionList(name = 'AddrCorrList') address_correction_list.load('address_corr_list.lst') print address_correction_list.length # Number of entries in the list