Address cleaning and standardisation is currently implemented using a HMM approach only, as described in more details in Chapter 7. Addresses are standardised into the seventeen address output fields as shown in the second column in Table 6.3. The address component standardiser is implemented in the modules standardisation.py and address.py.
A HMM based address standardiser can be initialised as shown in the code example below. The following arguments need to be given.
name
description
input_fields
field_separator
as
described below) into one string before the parsing and
standardisation is done.
output_fields
None
if no output is to be written (for
example if one is not interested in the country values), as long as
at least one output field is defined (not set to None
). Note
the last field which returns the probability returned from the
address HMM.
address_hmm
address_corr_list
address_tag_table
field_separator
' '
.
check_word_spill
True
or False
in order to
activate or de-activate the word spilling functionality (see
Section 6.1.4 for more details). If set
to true word spilling between input fields is checked using the
address tag look-up table. Note that word spilling is only useful if
the value of the field_separator
is not an empty string. The
default value for word spilling is True
.
The following example code shows how a HMM based standardiser for
addresses can be initialised. Note that the last three output fields
for territory, country and the HMM probability are set to None
,
thus no territory and country values will be stored in the output data
set, and the HMM probability is not stored either. Word spilling is
activated, and the field separator is set to be a whitespace (their
default values are taken as both these arguments are not given).
# ==================================================================== address_hmm_std = AddressHMMStandardiser(name = 'Addr-Standard-HMM', input_fields = ['wfarenum', 'wayfare', 'locality', 'pcode', 'state'], output_fields = ['wayfare_number', 'wayfare_name', 'wayfare_qualifier', 'wayfare_type', 'unit_number', 'unit_type', 'property_name', 'institution_name', 'institution_type', 'postaddr_number', 'postaddr_type', 'loc_name', 'loc_qualifier', 'postcode', None, None, None], address_corr_list = address_corr_list, address_tag_table = address_lookup_table, address_hmm = myaddress_hmm)