6.5 Address Cleaning and Standardisation using a Hidden Markov Model Based Approach

Address cleaning and standardisation is currently implemented using a HMM approach only, as described in more details in Chapter 7. Addresses are standardised into the seventeen address output fields as shown in the second column in Table 6.3. The address component standardiser is implemented in the modules standardisation.py and address.py.

A HMM based address standardiser can be initialised as shown in the code example below. The following arguments need to be given.

The following example code shows how a HMM based standardiser for addresses can be initialised. Note that the last three output fields for territory, country and the HMM probability are set to None, thus no territory and country values will be stored in the output data set, and the HMM probability is not stored either. Word spilling is activated, and the field separator is set to be a whitespace (their default values are taken as both these arguments are not given).

# ====================================================================

address_hmm_std = AddressHMMStandardiser(name = 'Addr-Standard-HMM',
                                 input_fields = ['wfarenum',
                                                 'wayfare',
                                                 'locality',
                                                 'pcode',
                                                 'state'],
                                 output_fields = ['wayfare_number',
                                                  'wayfare_name',
                                                  'wayfare_qualifier',
                                                  'wayfare_type',
                                                  'unit_number',
                                                  'unit_type',
                                                  'property_name',
                                                  'institution_name',
                                                  'institution_type',
                                                  'postaddr_number',
                                                  'postaddr_type',
                                                  'loc_name',
                                                  'loc_qualifier',
                                                  'postcode',
                                                  None,
                                                  None,
                                                  None],
                             address_corr_list = address_corr_list,
                             address_tag_table = address_lookup_table,
                                   address_hmm = myaddress_hmm)