14.1 Correction List

Correction list files contain characters or strings and their corresponding corrections (replacements). They are converted into Python lists which are used in the initial data cleaning step to replace a character or string with the corresponding replacement. Correction list files should have a file extension '.lst'. The format of these files is as follows:

# ====================================================================

# Remove characters and words from input (replace with single space)
             ' ' := '.', '?', '~', '_', ':', ';', '^', '=', ' na ',
                    ' n/a ', ' n.a.', '\', ' also ', ' name ',
                    ' only ', ' abbrev ', ' initials ', ' unk ',
                    ' unkn ', ' missing ', ' unknown '

# Correct words and symbols
         ' and ' := '+', '&'
        ' baby ' := ' babe '
     ' baby of ' := ' babyof ', ' babeof ', ' b/o ', ' b.o.'
    ' known as ' := ' knownas ', ' a.k.a. '

# Remove ' from o'brian etc
            ' o' := " o'"
            ' a' := " a'"

In the above example, all values in the first entry (an entry that goes over four lines) are replaced with a single space ' '. It is important that, for example, the value ' na ' starts with a space and ends with a space. Assuming these spaces were omitted, i.e. the value would be 'na', then each occurrence of the string 'na' in the input would be replaced by a single space. The word 'annabella' would thus be replaced with 'an bella' which is not what is wanted.

The list of value and replacement pairs is internally sorted with decreasing length of the values. Long value strings are therefore replaced before shorter strings or characters are replaced. In this way, the value ' a.k.a. ' is replaced with ' known as ' before full-stops (periods) '.' are replaced by a space ' '.

Assuming the lookup.py module has been imported using the import lookup command, an example correction list can be initialised and loaded as shown in the following example.

# ====================================================================

address_correction_list = lookup.CorrectionList(name = 'AddrCorrList')

address_correction_list.load('address_corr_list.lst')

print address_correction_list.length  # Number of entries in the list