Address cleaning and standardisation is currently implemented using a HMM approach only, as described in more details in Chapter 7. Addresses are standardised into the seventeen address output fields as shown in the second column in Table 6.3. The address component standardiser is implemented in the modules standardisation.py and address.py.
A HMM based address standardiser can be initialised as shown in the code example below. The following arguments need to be given.
field_separatoras described below) into one string before the parsing and standardisation is done.
Noneif no output is to be written (for example if one is not interested in the country values), as long as at least one output field is defined (not set to
None). Note the last field which returns the probability returned from the address HMM.
Falsein order to activate or de-activate the word spilling functionality (see Section 6.1.4 for more details). If set to true word spilling between input fields is checked using the address tag look-up table. Note that word spilling is only useful if the value of the
field_separatoris not an empty string. The default value for word spilling is
The following example code shows how a HMM based standardiser for
addresses can be initialised. Note that the last three output fields
for territory, country and the HMM probability are set to
thus no territory and country values will be stored in the output data
set, and the HMM probability is not stored either. Word spilling is
activated, and the field separator is set to be a whitespace (their
default values are taken as both these arguments are not given).
# ==================================================================== address_hmm_std = AddressHMMStandardiser(name = 'Addr-Standard-HMM', input_fields = ['wfarenum', 'wayfare', 'locality', 'pcode', 'state'], output_fields = ['wayfare_number', 'wayfare_name', 'wayfare_qualifier', 'wayfare_type', 'unit_number', 'unit_type', 'property_name', 'institution_name', 'institution_type', 'postaddr_number', 'postaddr_type', 'loc_name', 'loc_qualifier', 'postcode', None, None, None], address_corr_list = address_corr_list, address_tag_table = address_lookup_table, address_hmm = myaddress_hmm)