Similar to configuring and running a standardisation, deduplication or linkage process based on one of the Febrl modules project-standardise.py, project-deduplicate.py, or project-linkage.py, a geocoding process needs to be configured and can then be run using a module which is based on the module project-geocode.py. Before a geocoding process can be started the geocode reference data set (assumed to be the Australian G-NAF) must have been processed (and inverted indexes must have been created) using the program process-gnaf.py as described in Section 10.3 above.
The following code sections and explanatory texts in-between describe in details how to configure a geocoding process. Assuming this configuration has been done and a new module called my-project-geocode.py has been saved, a geocoding process can then be started from the command line with:
python my-project-geocode.py
At the top of the project-geocode.py module is the header with version and licensing information, followed by a documentation string that gives the name of the module and a short description (not shown here). Next all required Febrl and standard Python modules are imported (not shown either) so the necessary functionalities are available.
In the first shown code block (below) a project logger is initialised
first (see Chapter 15 for more details) which
defines how much information is printed into the terminal window and
how much is saved into a log file. Next the Febrl system is
initialised by creating a Febrl object myfebrl
, and
a new project is initialised as part of it.
# ==================================================================== # Define a project logger init_febrl_logger(log_file_name = 'febrl-example-geocode.log', file_level = 'INFO', console_level = 'INFO', clear_log = True, parallel_output = 'host') # ==================================================================== # Set up Febrl and create a new project myfebrl = Febrl(description = 'Example geocoding Febrl instance', febrl_path = '.') myproject = myfebrl.new_project(name = 'example-geocode', description = 'Geocode example data set 1', file_name = 'example-geocode.fbr', block_size = 100, parallel_write = 'host')
In the next code block the input data set is defined, i.e. the data
set to be geocoded. See Chapter 13 for more
detailed information about how to define a data set. Please note that
the access mode has to be set to 'read'
. In the given example
it is assumed that each record is made of one single text field only
(called 'address'
) containing the raw unstandardised input
address.
# ==================================================================== # Define original input data set # indata = DataSetCSV(name = 'example1In', description = 'Example input data set', access_mode = 'read', header_lines = 0, file_name = 'geocode'+dirsep+'testaddresses-small.txt', fields = {'address':0}, fields_default = '', strip_fields = True, missing_values = ['','missing'])
The following code section contains the definition of the output data
set, i.e. the cleaned, standardised and geocoded data set. Again, see
Chapter 13 for more information about how to define
a data set. Note that the access mode has to be 'write'
, and
that the defined 'fields'
must be a dictionary which includes
the standardised output fields as described in
Section 6.5 (Address cleaning and
standardisation), with up to eight fields returned from the geocoding
process ('longitude'
, 'latitude'
,
'match_status'
, 'match_weight'
, 'gnaf_pid'
,
'collection_district'
, 'neighbour_level'
, and
'max_avrg_distance'
). The longitude and latitude contain the
geocoded location of the record, with the 'match_status'
giving a short description of how the match has been achieved. The
neighbour level is set to the level where the match was achieved, and
the maximum average distance is a value in metres when an average
match is returned, zero otherwise.
Possible match status values are:
exact address match
average address match
many address match
exact street match
average street match
many street match
exact locality match
average locality match
many locality match
no match
A numerical match weight is calculated during the geocode matching and
returned in the 'match_weight'
field (the larger the weight
the better the match). The corresponding G-NAF persistent
identifier(s) (PIDs) will be returned in the 'gnaf_pid'
, and
if the match corresponds to a unique collection district it will be
returned in the 'collection_district'
field. It is also
possible to add the original input record into the output records
(assuming a PassFieldStandardiser
has been defined - see
below).
# ==================================================================== # Define the output data set outdata = DataSetCSV(name = 'example1geocoded', description = 'Geocoded example data set', access_mode = 'write', write_header = True, file_name = 'testaddresses-geocoded.csv', write_quote_char = '', missing_values = ['','missing'], fields = {'wayfare_number':0, 'wayfare_name':1, 'wayfare_qualifier':2, 'wayfare_type':3, 'unit_number':4, 'unit_type':5, 'property_name':6, 'institution_name':7, 'institution_type':8, 'postaddress_number':9, 'postaddress_type':10, 'locality_name':11, 'locality_qualifier':12, 'postcode':13, 'territory':14, # The next five output fields contain the geocoding latitude and # longitude, match status, match weight and the G-NAF idendentifier(s) 'latitude':15, 'longitude':16, 'match_status':17, 'match_status':18, 'gnaf_pid':19,
# The NSW collection district will be written to the output record 'collection_district':20, 'neighbour_level':21, 'max_avrg_distance':22, # Finally we pass the original input address into the output 'input_record':23})
Next look-up tables for neighbouring regions (postcodes and localities) are defined and loaded from files. These look-up tables contain for a postcode or locality (suburb, town) a list of its neighbours. Neighbouring look-up tables are defined for level 1 (i.e. direct neighbours only) as well as level 2 (i.e. including all the neighbours of the direct neighbours).
# ==================================================================== # Define and load the neighbouring regions look-up tables # pc_level_1_table = NeighbourLookupTable(name = 'PC-1', default = []) pc_level_1_table.load(file_names = 'pc-neighbours-1.txt') pc_level_2_table = NeighbourLookupTable(name = 'PC-2', default = []) pc_level_2_table.load(file_names = 'pc-neighbours-2.txt') sub_level_1_table = NeighbourLookupTable(name = 'Sub-1', default = []) sub_level_1_table.load(file_names = 'suburb-neighbours-1.txt') sub_level_2_table = NeighbourLookupTable(name = 'Sub-2', default = []) sub_level_2_table.load(file_names = 'suburb-neighbours-2.txt')
The following code block below defines the approximate -gram index (which is implemented in the module qgramindex.py). Such indices can be defined for any field, but are especially useful for fields containing strings (like street and locality names). Also, keep in mind that approximate indices come at the cost: Searching for approximate matches is compute intensive, especially if a field contains a large number of different values (there are around 5,000 locality names in the New South Wales part of G-NAF and around 36,000 street names). When initialising an approximate index the following attributes need to be set.
name
and descriptions
field_name
index_files
dictionary (see below), for which the approximate index will be
built.
q_gram_len_list
max_edit_dist
load_pickle_file
True
or False
. As
creating an approximate index is quite time consuming, it is
possible to save the index into a binary Python pickle file once
created (this is done if a pickle_file_name
is given).
If this flag is set to True
and the file with the given
name does exist it will be loaded, otherwise the index will be
created.
pickle_file_name
load_pickle_file
is True
then the index
will be loaded from this file.
field_name
in the corresponding approximate indices). This
dictionary is then given to the geocoder as shown below.
# ==================================================================== # Define the approximate q-gram index for the geocoder (the field # names must correspond to a key in the 'index_files' dictionary of # the geocoder) # loc_name_qgram_index = PosQGramIndex(name = 'loc name q-gram index', description = 'Q-Gram index for locality name', field_name = 'locality_name', q_gram_len_list = [(1,(1,4)),(2,(5,99))], max_edit_dist = 2, load_pickle_file = True, pickle_file_name = 'geocode'+dirsep+ \ 'loc_name_qgram_index.pik') street_name_qgram_index = PosQGramIndex(name = 'street name index', description = 'Q-Gram index for street name', field_name = 'street_name', q_gram_len_list = [(1,(1,4)),(2,(5,99))], max_edit_dist = 2, load_pickle_file = True, pickle_file_name = 'geocode'+dirsep+ \ 'street_name_qgram_index.pik') # Put all approximate q-gram indices into a dictionary with field # names as keys (the keys must correspond to the 'field_name' entries # in the corresponding approximate index) # approx_indices = {'locality_name':loc_name_qgram_index, 'street_name':street_name_qgram_index}
In the next code block the main geocoder is initialised. This involves many different settings.
geocode-_file_directory
.
address_site_geocode_file
,
street_locality-_geocode_file
and
locality_geocode_file
) the type used and loaded depends
upon the given file extension (either pickle or shelve).
input_data_set
has to be set to the
output data set defined above (i.e. the output data set of the
standardisation).
input_fields
dictionary is a mapping from G-NAF
fields (or attributes) to standardised Febrl
fields. It is possible to set some of these to None
. For
each of the fields a numerical matching weight needs to be
given, which is added to the total matching weight of a record
during the geocoding process. The larger the final matching
weight is, the better the matching quality is (i.e. the more
fields we successfully used in the matching).
match_threshold
can be set to a value between
and and it controls the way approximate and neighbouring
region matches are handled. A higher match threshold corresponds
to approximate matches with lower values being filtered out.
best_match_only
is set to True
only the best
match is returned, otherwise (set to False
) a list
(sorted according to their match weight) is returned.
missing_value_weight
will be added to the final
matching for this record.
maximal_neighbour_level
. Possible values are 0
(don't search neighbouring values), 1
(search direct
neighbours), or 2
(search direct and indirect
neighbours).
max_average_address
sets the maximum distance (in
metres) for which an average location is calculated when several
matches are found. If the distance between matches is larger
than this maximal value then a 'many match'
is returned
(with no coordinates).
index_files
contains references
to all the inverted binary indexes created with
process-gnaf.py and stored in the
geocode_file_directory
. For each file the type of
inverted index has to be given (either 'pickle'
or
'shelve'
), and according to this type the corresponding
inverted index will be full loaded into memory (pickle) or will
be disk based (shelve).
# ==================================================================== # The main Geocoder object # example_geocoder = Geocoder(name = 'example1geocode', description = 'Example geocoder', geocode_file_directory = 'gnaf'+dirsep+'shelve_pickles'+ \ dirsep, pickle_files_extension = '.pik', shelve_files_extension = '.slv', address_site_geocode_file = 'address_site_geocode.slv', street_locality_geocode_file = 'street_locality_geocode.pik', locality_geocode_file = 'locality_geocode.pik', collection_district_file = 'collection_district.slv', input_data_set = outdata,
input_fields = {'building_name':('property_name', 1.0), 'location_description':('institution_name', 1.0), 'locality_name':('locality_name', 6.0), 'locality_qualifier':('locality_qualifier', 1.0), 'postcode':('postcode', 5.0), 'state':('territory', 1.0), 'wayfare_name':('wayfare_name', 5.0), 'wayfare_qualifier':('wayfare_qualifier', 1.0), 'wayfare_type':('wayfare_type', 3.0), 'wayfare_number':('wayfare_number', 3.0), 'flat_number':('unit_number', 2.0), 'flat_qualifier':None, 'flat_type':('unit_type', 2.0), 'level_number':None, 'level_type':None, 'lot_number':('postaddress_number', 2.0), 'lot_number_qualifier':('postaddress_type', 1.0)}, match_threshold = 0.8, best_match_only = True, missing_value_weight = 0.0, maximal_neighbour_level = 2, max_average_address = 50, postcode_neighbours_1 = pc_level_1_table, postcode_neighbours_2 = pc_level_2_table, suburb_neighbours_1 = sub_level_1_table, suburb_neighbours_2 = sub_level_2_table, index_files = {'building_name':('building_name', 'pickle'), 'flat_number':('flat_number', 'pickle'), 'flat_number_prefix':('flat_number_prefix', 'pickle'), 'flat_number_suffix':('flat_number_suffix', 'pickle'), 'flat_type':('flat_type', 'pickle'), 'level_number':('level_number', 'pickle'), 'level_type':('level_type', 'pickle'), 'locality_name':('locality_name', 'pickle'), 'location_descr':('location_descr', 'pickle'), 'lot_number':('lot_number', 'shelve'), 'lot_number_prefix':('lot_number_prefix', 'pickle'), 'lot_number_suffix':('lot_number_suffix', 'pickle'), 'number_first':('number_first', 'shelve'), 'number_first_prefix':('number_first_prefix', 'pickle'), 'number_first_suffix':('number_first_suffix', 'pickle'), 'number_last':('number_last', 'pickle'), 'number_last_prefix':('number_last_prefix', 'pickle'), 'number_last_suffix':('number_last_suffix', 'pickle'), 'postcode':('postcode', 'pickle'), 'state_abbrev':('state_abbrev', 'pickle'), 'street_name':('street_name', 'shelve'), 'street_suffix':('street_suffix', 'pickle'), 'street_type':('street_type', 'pickle')})
Next the look-up tables for the address cleaning and standardisation process have to be initialised and loaded. This is similar to what needs to be done for standardisation, deduplication or linkage processes. More details about look-up tables can be found in Chapter 14.
# ==================================================================== # Define and load lookup tables address_lookup_table = TagLookupTable(name = 'Address lookup table', default = '') address_lookup_table.load(['data'+dirsep+'country.tbl', 'data'+dirsep+'address_misc.tbl', 'data'+dirsep+'address_qual.tbl', 'data'+dirsep+'institution_type.tbl', 'data'+dirsep+'locality_name_act.tbl', 'data'+dirsep+'locality_name_nsw.tbl', 'data'+dirsep+'post_address.tbl', 'data'+dirsep+'postcode_act.tbl', 'data'+dirsep+'postcode_nsw.tbl', 'data'+dirsep+'saints.tbl', 'data'+dirsep+'territory.tbl', 'data'+dirsep+'unit_type.tbl', 'data'+dirsep+'wayfare_type.tbl']) address_correction_list = CorrectionList(name = 'Address corr. list') address_correction_list.load('data'+dirsep+'address_corr.lst')
A hidden Markov model (HMM) for address standardisation is initialised and loaded next, and then the standardiser for addresses is defined. For more details about HMM address standardisation see Section 6.5.
# ==================================================================== # Define and load hidden address Markov model (HMM) address_states = ['wfnu','wfna1','wfna2','wfql','wfty','unnu','unty', 'prna1','prna2','inna1','inna2','inty','panu', 'paty','hyph','sla','coma','opbr','clbr','loc1', 'loc2','locql','pc','ter1','ter2','cntr1','cntr2', 'rubb'] address_tags = ['PC','N4','NU','AN','TR','CR','LN','ST','IN','IT', 'LQ','WT','WN','UT','HY','SL','CO','VB','PA','UN', 'RU'] myaddress_hmm = hmm('Address HMM', address_states, address_tags) myaddress_hmm.load_hmm('hmm'+dirsep+'geocode-nsw-address.hmm')
# ==================================================================== # Define a standardiser for addresses based on HMM address_hmm_std = AddressHMMStandardiser(name = 'Address-HMM', input_fields = ['address'], output_fields = ['wayfare_number', 'wayfare_name', 'wayfare_qualifier', 'wayfare_type', 'unit_number', 'unit_type', 'property_name', 'institution_name', 'institution_type', 'postaddress_number', 'postaddress_type', 'locality_name', 'locality_qualifier', 'postcode', 'territory', None, None], address_corr_list = address_corr_list, address_tag_table = address_lookup_table, address_hmm = myaddress_hmm)
In order to be able to have the original input records copied into the
geocoded output records a PassFieldStandardiser
needs to be
defined (see Section 6.8 for more details).
Of course if your input data set has different input fields this field
pass standardiser has to be modified.
# ==================================================================== pass_fields = PassFieldStandardiser(name = 'Pass fields', input_fields = ['address'], output_fields = ['input_record'])
The next code block shows how a record standardiser is set up (more details about this can be found in Section 6.9).
# ==================================================================== # Define record standardiser(s) for the input data set comp_stand = [address_hmm_std, pass_fields] example_standardiser = RecordStandardiser(name = 'Example-std', description = 'Ex. standardiser', input_dataset = indata, output_dataset = outdata, comp_std = comp_stand)
Starting the geocoding process is now simple by calling the
geocode
method in the main project object. Input parameters for
this method are the input and output data sets, the record
standardiser, the geocoder, as well as the first record number and the
number of records from the input data set which are to be loaded,
standardised and geocoded.
# ==================================================================== # Start the geocoding task # - If 'first_record' is set to 'None' then it will be set to 0 # - If 'number_records' is set to 'None' then it will be set to the # total number of records in the input data set. myproject.geocode(input_dataset = indata, output_dataset = outdata, rec_standardiser = example_standardiser, rec_geocoder = example_geocoder, first_record = 0, number_records = 10000)
The last step is to properly shut down the Febrl system by
using the finalise()
method on the Febrl project
object.
# ==================================================================== myfebrl.finalise() # ====================================================================