10.4 Geocoding Project Module 'project-geocode.py'

Similar to configuring and running a standardisation, deduplication or linkage process based on one of the Febrl modules project-standardise.py, project-deduplicate.py, or project-linkage.py, a geocoding process needs to be configured and can then be run using a module which is based on the module project-geocode.py. Before a geocoding process can be started the geocode reference data set (assumed to be the Australian G-NAF) must have been processed (and inverted indexes must have been created) using the program process-gnaf.py as described in Section 10.3 above.

The following code sections and explanatory texts in-between describe in details how to configure a geocoding process. Assuming this configuration has been done and a new module called my-project-geocode.py has been saved, a geocoding process can then be started from the command line with:

At the top of the project-geocode.py module is the header with version and licensing information, followed by a documentation string that gives the name of the module and a short description (not shown here). Next all required Febrl and standard Python modules are imported (not shown either) so the necessary functionalities are available.

In the first shown code block (below) a project logger is initialised first (see Chapter 15 for more details) which defines how much information is printed into the terminal window and how much is saved into a log file. Next the Febrl system is initialised by creating a Febrl object myfebrl, and a new project is initialised as part of it.

# ====================================================================
# Define a project logger

init_febrl_logger(log_file_name = 'febrl-example-geocode.log',
                     file_level = 'INFO',
                  console_level = 'INFO',
                      clear_log = True,
                parallel_output = 'host')

# ====================================================================
# Set up Febrl and create a new project

myfebrl = Febrl(description = 'Example geocoding Febrl instance',
                 febrl_path = '.')

myproject = myfebrl.new_project(name = 'example-geocode',
                         description = 'Geocode example data set 1',
                           file_name = 'example-geocode.fbr',
                          block_size = 100,
                      parallel_write = 'host')

In the next code block the input data set is defined, i.e. the data set to be geocoded. See Chapter 13 for more detailed information about how to define a data set. Please note that the access mode has to be set to 'read'. In the given example it is assumed that each record is made of one single text field only (called 'address') containing the raw unstandardised input address.

# ====================================================================
# Define original input data set
# 

indata = DataSetCSV(name = 'example1In',
             description = 'Example input data set',
             access_mode = 'read',
            header_lines = 0,
               file_name = 'geocode'+dirsep+'testaddresses-small.txt',
                  fields = {'address':0},
          fields_default = '',
            strip_fields = True,
          missing_values = ['','missing'])

The following code section contains the definition of the output data set, i.e. the cleaned, standardised and geocoded data set. Again, see Chapter 13 for more information about how to define a data set. Note that the access mode has to be 'write', and that the defined 'fields' must be a dictionary which includes the standardised output fields as described in Section 6.5 (Address cleaning and standardisation), with up to eight fields returned from the geocoding process ('longitude', 'latitude', 'match_status', 'match_weight', 'gnaf_pid', 'collection_district', 'neighbour_level', and 'max_avrg_distance'). The longitude and latitude contain the geocoded location of the record, with the 'match_status' giving a short description of how the match has been achieved. The neighbour level is set to the level where the match was achieved, and the maximum average distance is a value in metres when an average match is returned, zero otherwise. Possible match status values are:
exact address match
average address match
many address match
exact street match
average street match
many street match
exact locality match
average locality match
many locality match
no match

A numerical match weight is calculated during the geocode matching and returned in the 'match_weight' field (the larger the weight the better the match). The corresponding G-NAF persistent identifier(s) (PIDs) will be returned in the 'gnaf_pid', and if the match corresponds to a unique collection district it will be returned in the 'collection_district' field. It is also possible to add the original input record into the output records (assuming a PassFieldStandardiser has been defined - see below).

# ====================================================================
# Define the output data set

outdata = DataSetCSV(name = 'example1geocoded',
              description = 'Geocoded example data set',
              access_mode = 'write',
             write_header = True,
                file_name = 'testaddresses-geocoded.csv',
         write_quote_char = '',
           missing_values = ['','missing'],
                   fields = {'wayfare_number':0,
                             'wayfare_name':1,
                             'wayfare_qualifier':2,
                             'wayfare_type':3,
                             'unit_number':4,
                             'unit_type':5,
                             'property_name':6,
                             'institution_name':7,
                             'institution_type':8,
                             'postaddress_number':9,
                             'postaddress_type':10,
                             'locality_name':11,
                             'locality_qualifier':12,
                             'postcode':13,
                             'territory':14,
# The next five output fields contain the geocoding latitude and
# longitude, match status, match weight and the G-NAF idendentifier(s)
                             'latitude':15,
                             'longitude':16,
                             'match_status':17,
                             'match_status':18,
                             'gnaf_pid':19,

# The NSW collection district will be written to the output record
                             'collection_district':20,
                             'neighbour_level':21,
                             'max_avrg_distance':22,
# Finally we pass the original input address into the output 
                             'input_record':23})

Next look-up tables for neighbouring regions (postcodes and localities) are defined and loaded from files. These look-up tables contain for a postcode or locality (suburb, town) a list of its neighbours. Neighbouring look-up tables are defined for level 1 (i.e. direct neighbours only) as well as level 2 (i.e. including all the neighbours of the direct neighbours).

# ====================================================================
# Define and load the neighbouring regions look-up tables
#
pc_level_1_table = NeighbourLookupTable(name = 'PC-1', default = [])
pc_level_1_table.load(file_names = 'pc-neighbours-1.txt')

pc_level_2_table = NeighbourLookupTable(name = 'PC-2', default = [])
pc_level_2_table.load(file_names = 'pc-neighbours-2.txt')

sub_level_1_table = NeighbourLookupTable(name = 'Sub-1', default = [])
sub_level_1_table.load(file_names = 'suburb-neighbours-1.txt')

sub_level_2_table = NeighbourLookupTable(name = 'Sub-2', default = [])
sub_level_2_table.load(file_names = 'suburb-neighbours-2.txt')

The following code block below defines the approximate

-gram index (which is implemented in the module qgramindex.py). Such indices can be defined for any field, but are especially useful for fields containing strings (like street and locality names). Also, keep in mind that approximate indices come at the cost: Searching for approximate matches is compute intensive, especially if a field contains a large number of different values (there are around 5,000 locality names in the New South Wales part of G-NAF and around 36,000 street names). When initialising an approximate index the following attributes need to be set.

# ====================================================================
# Define the approximate q-gram index for the geocoder (the field
# names must correspond to a key in the 'index_files' dictionary of
# the geocoder)
#
loc_name_qgram_index = PosQGramIndex(name = 'loc name q-gram index',
                   description = 'Q-Gram index for locality name',
                    field_name = 'locality_name',
               q_gram_len_list = [(1,(1,4)),(2,(5,99))],
                 max_edit_dist = 2,
              load_pickle_file = True,
              pickle_file_name = 'geocode'+dirsep+ \
                                 'loc_name_qgram_index.pik')

street_name_qgram_index = PosQGramIndex(name = 'street name index',
                description = 'Q-Gram index for street name',
                 field_name = 'street_name',
            q_gram_len_list = [(1,(1,4)),(2,(5,99))],
              max_edit_dist = 2,
           load_pickle_file = True,
           pickle_file_name = 'geocode'+dirsep+ \
                              'street_name_qgram_index.pik')

# Put all approximate q-gram indices into a dictionary with field
# names as keys (the keys must correspond to the 'field_name' entries
# in the corresponding approximate index)
#
approx_indices = {'locality_name':loc_name_qgram_index,
                    'street_name':street_name_qgram_index}

In the next code block the main geocoder is initialised. This involves many different settings.

# ====================================================================
# The main Geocoder object
#
example_geocoder = Geocoder(name = 'example1geocode',
                     description = 'Example geocoder',
          geocode_file_directory = 'gnaf'+dirsep+'shelve_pickles'+ \
                                   dirsep,
          pickle_files_extension = '.pik',
          shelve_files_extension = '.slv',
       address_site_geocode_file = 'address_site_geocode.slv',
    street_locality_geocode_file = 'street_locality_geocode.pik',
           locality_geocode_file = 'locality_geocode.pik',
        collection_district_file = 'collection_district.slv',
                  input_data_set = outdata,

         input_fields = {'building_name':('property_name',      1.0),
                  'location_description':('institution_name',   1.0),
                         'locality_name':('locality_name',      6.0),
                    'locality_qualifier':('locality_qualifier', 1.0),
                              'postcode':('postcode',           5.0),
                                 'state':('territory',          1.0),
                          'wayfare_name':('wayfare_name',       5.0),
                     'wayfare_qualifier':('wayfare_qualifier',  1.0),
                          'wayfare_type':('wayfare_type',       3.0),
                        'wayfare_number':('wayfare_number',     3.0),
                           'flat_number':('unit_number',        2.0),
                        'flat_qualifier':None,
                             'flat_type':('unit_type',          2.0),
                          'level_number':None,
                            'level_type':None,
                            'lot_number':('postaddress_number', 2.0),
                  'lot_number_qualifier':('postaddress_type',   1.0)},
                 match_threshold = 0.8,
                 best_match_only = True,
            missing_value_weight = 0.0,
         maximal_neighbour_level = 2,
             max_average_address = 50,
           postcode_neighbours_1 = pc_level_1_table,
           postcode_neighbours_2 = pc_level_2_table,
             suburb_neighbours_1 = sub_level_1_table,
             suburb_neighbours_2 = sub_level_2_table,
    index_files = {'building_name':('building_name',       'pickle'),
                     'flat_number':('flat_number',         'pickle'),
              'flat_number_prefix':('flat_number_prefix',  'pickle'),
              'flat_number_suffix':('flat_number_suffix',  'pickle'),
                       'flat_type':('flat_type',           'pickle'),
                    'level_number':('level_number',        'pickle'),
                      'level_type':('level_type',          'pickle'),
                   'locality_name':('locality_name',       'pickle'),
                  'location_descr':('location_descr',      'pickle'),
                      'lot_number':('lot_number',          'shelve'),
               'lot_number_prefix':('lot_number_prefix',   'pickle'),
               'lot_number_suffix':('lot_number_suffix',   'pickle'),
                    'number_first':('number_first',        'shelve'),
             'number_first_prefix':('number_first_prefix', 'pickle'),
             'number_first_suffix':('number_first_suffix', 'pickle'),
                     'number_last':('number_last',         'pickle'),
              'number_last_prefix':('number_last_prefix',  'pickle'),
              'number_last_suffix':('number_last_suffix',  'pickle'),
                        'postcode':('postcode',            'pickle'),
                    'state_abbrev':('state_abbrev',        'pickle'),
                     'street_name':('street_name',         'shelve'),
                   'street_suffix':('street_suffix',       'pickle'),
                     'street_type':('street_type',         'pickle')})

Next the look-up tables for the address cleaning and standardisation process have to be initialised and loaded. This is similar to what needs to be done for standardisation, deduplication or linkage processes. More details about look-up tables can be found in Chapter 14.

# ====================================================================
# Define and load lookup tables

address_lookup_table = TagLookupTable(name = 'Address lookup table',
                                   default = '')
address_lookup_table.load(['data'+dirsep+'country.tbl',
                           'data'+dirsep+'address_misc.tbl',
                           'data'+dirsep+'address_qual.tbl',
                           'data'+dirsep+'institution_type.tbl',
                           'data'+dirsep+'locality_name_act.tbl',
                           'data'+dirsep+'locality_name_nsw.tbl',
                           'data'+dirsep+'post_address.tbl',
                           'data'+dirsep+'postcode_act.tbl',
                           'data'+dirsep+'postcode_nsw.tbl',
                           'data'+dirsep+'saints.tbl',
                           'data'+dirsep+'territory.tbl',
                           'data'+dirsep+'unit_type.tbl',
                           'data'+dirsep+'wayfare_type.tbl'])

address_correction_list = CorrectionList(name = 'Address corr. list')
address_correction_list.load('data'+dirsep+'address_corr.lst')

A hidden Markov model (HMM) for address standardisation is initialised and loaded next, and then the standardiser for addresses is defined. For more details about HMM address standardisation see Section 6.5.

# ====================================================================
# Define and load hidden address Markov model (HMM)

address_states = ['wfnu','wfna1','wfna2','wfql','wfty','unnu','unty',
                  'prna1','prna2','inna1','inna2','inty','panu',
                  'paty','hyph','sla','coma','opbr','clbr','loc1',
                  'loc2','locql','pc','ter1','ter2','cntr1','cntr2',
                  'rubb']
address_tags = ['PC','N4','NU','AN','TR','CR','LN','ST','IN','IT',
                'LQ','WT','WN','UT','HY','SL','CO','VB','PA','UN',
                'RU']

myaddress_hmm = hmm('Address HMM', address_states, address_tags)
myaddress_hmm.load_hmm('hmm'+dirsep+'geocode-nsw-address.hmm')

# ====================================================================
# Define a standardiser for addresses based on HMM

address_hmm_std = AddressHMMStandardiser(name = 'Address-HMM',
                                 input_fields = ['address'],
                                output_fields = ['wayfare_number',
                                                 'wayfare_name',
                                                 'wayfare_qualifier',
                                                 'wayfare_type',
                                                 'unit_number',
                                                 'unit_type',
                                                 'property_name',
                                                 'institution_name',
                                                 'institution_type',
                                                 'postaddress_number',
                                                 'postaddress_type',
                                                 'locality_name',
                                                 'locality_qualifier',
                                                 'postcode',
                                                 'territory',
                                                 None,
                                                 None],
                            address_corr_list = address_corr_list,
                            address_tag_table = address_lookup_table,
                                  address_hmm = myaddress_hmm)

In order to be able to have the original input records copied into the geocoded output records a PassFieldStandardiser needs to be defined (see Section 6.8 for more details). Of course if your input data set has different input fields this field pass standardiser has to be modified.

# ====================================================================

pass_fields = PassFieldStandardiser(name = 'Pass fields',
                            input_fields = ['address'],
                           output_fields = ['input_record'])

The next code block shows how a record standardiser is set up (more details about this can be found in Section 6.9).

# ====================================================================
# Define record standardiser(s) for the input data set

comp_stand = [address_hmm_std, pass_fields]

example_standardiser = RecordStandardiser(name = 'Example-std',
                                   description = 'Ex. standardiser',
                                 input_dataset = indata,
                                output_dataset = outdata,
                                      comp_std = comp_stand)

Starting the geocoding process is now simple by calling the geocode method in the main project object. Input parameters for this method are the input and output data sets, the record standardiser, the geocoder, as well as the first record number and the number of records from the input data set which are to be loaded, standardised and geocoded.

# ====================================================================
# Start the geocoding task
# - If 'first_record' is set to 'None' then it will be set to 0
# - If 'number_records' is set to 'None' then it will be set to the
#   total number of records in the input data set.

myproject.geocode(input_dataset = indata,
                 output_dataset = outdata,
               rec_standardiser = example_standardiser,
                   rec_geocoder = example_geocoder,
                   first_record = 0,
                 number_records = 10000)

The last step is to properly shut down the Febrl system by using the finalise() method on the Febrl project object.