6.5 Address Cleaning and Standardisation using a Hidden Markov Model Based Approach

Address cleaning and standardisation is currently implemented using a HMM approach only, as described in more details in Chapter 7. Addresses are standardised into the seventeen address output fields as shown in the second column in Table 6.3. The address component standardiser is implemented in the modules standardisation.py and address.py.

A HMM based address standardiser can be initialised as shown in the code example below. The following arguments need to be given.

name
A name for the address standardiser. Most suitable is a short string.

description
A longer description of the address standardiser. Note that this argument is not mandatory, the standardisation process works fine without a description.

input_fields
A string with a field name or a list of field names from the input data set, that will be standardised. If a list is given the field values will be concatenated (using the field_separator as described below) into one string before the parsing and standardisation is done.

output_fields
A list of seventeen field names as defined in the output data set. The address standardiser returns standardised addresses in these seventeen fields, in the sequence as given in the second column in Table 6.3. It is possible to set some of these output fields to None if no output is to be written (for example if one is not interested in the country values), as long as at least one output field is defined (not set to None). Note the last field which returns the probability returned from the address HMM.

address_hmm
A reference to a hidden Markov model for addresses, which must be initialised and loaded. See the example code on page on how to initialise and load a HMM.

address_corr_list
A reference to the address correction list to be used. This list must be initialised and loaded previously. See Chapter 14 and Section 14.1 for more details.

address_tag_table
A reference to the address tagging table to be used. This table must be initialised and loaded previously. See Chapter 14 and Section 14.2 for more details.

field_separator
When more than one input field is given, they are concatenated and a field separator character (which can be an empty string) is inserted between them. The default value for the field separator is a whitespace ' '.

check_word_spill
A flag that can be set to True or False in order to activate or de-activate the word spilling functionality (see Section 6.1.4 for more details). If set to true word spilling between input fields is checked using the address tag look-up table. Note that word spilling is only useful if the value of the field_separator is not an empty string. The default value for word spilling is True.

The following example code shows how a HMM based standardiser for addresses can be initialised. Note that the last three output fields for territory, country and the HMM probability are set to None, thus no territory and country values will be stored in the output data set, and the HMM probability is not stored either. Word spilling is activated, and the field separator is set to be a whitespace (their default values are taken as both these arguments are not given).

# ====================================================================

address_hmm_std = AddressHMMStandardiser(name = 'Addr-Standard-HMM',
                                 input_fields = ['wfarenum',
                                                 'wayfare',
                                                 'locality',
                                                 'pcode',
                                                 'state'],
                                 output_fields = ['wayfare_number',
                                                  'wayfare_name',
                                                  'wayfare_qualifier',
                                                  'wayfare_type',
                                                  'unit_number',
                                                  'unit_type',
                                                  'property_name',
                                                  'institution_name',
                                                  'institution_type',
                                                  'postaddr_number',
                                                  'postaddr_type',
                                                  'loc_name',
                                                  'loc_qualifier',
                                                  'postcode',
                                                  None,
                                                  None,
                                                  None],
                             address_corr_list = address_corr_list,
                             address_tag_table = address_lookup_table,
                                   address_hmm = myaddress_hmm)