10.3 Program 'process-gnaf.py'

In order to be able to efficiently use the G-NAF data set for geocoding, the necessary G-NAF files need to be cleaned, pre-processed and indexed so that matching records (and their longitude and latitude) can be retrieved in an efficient and fast way.

The program process-gnaf.py (available in the geocode directory) does exactly this pre-processing of the G-NAF files, which are assumed to be available as CSV (comma separated values) text files. The main computations routines used in process-gnaf.py are implemented in the module gnaffunctions.py which is available in the directory geocode.

As discussed in Section 10.1, G-NAF consists of many files containing the normalised address, street and locality data, geocoding information (for address sites, streets and localities), as well as various alias information.

All settings for process-gnaf.py need to be specified by the user within the program itself. The main processing flags or switches that control what kind of pre-processing is performed are discussed first followed by all other process-gnaf.py settings.

At the beginning of process-gnaf.py - after the license header, module description and module imports - is a code section which contains a number of flags (or switches) which can be set to True or False in order to enable or disable the processing of parts of the G-NAF files. The following flags can be set.

check_pid_uniqueness
If set to True all PIDs (persistent identifiers) in the G-NAF files G_ADDRESS_DETAILS, G_ADDRESS_SITE, G_STREET, and G_LOCALITY are checked for their uniqueness, and errors are reported.
save_pickle_files
Set this flag to True if the inverted indexes are to be saved into Python pickle files. See the Python documentation (module index) for information about the pickle module.
save_shelve_files
Set this flag to True if the inverted indexes are to be saved into Python shelve files. See the Python documentation (module index) for information about the shelve module.
save_text_files
Set this flag to True if the inverted indexes are to be saved into text files. This is mainly for debugging purpose, so that the inverted indexes can be viewed easily. These text files can not be used by the geocoding process.
process_coll_dist_files
In some cases special external look-up files with data about collection districts (in Australia) might be available and useful for later processing of the geocoded records. Set this flag to True (and define the corresponding file details as described below) if such collection district files are to be processed (and saved into pickle, shelve and/or text files according to the above given save_..._files flags).
process_locality_files
Set this flag to True if the inverted indexes for locality related attributes (using locality related G-NAF files) are to be saved into pickle, shelve and/or text files (according to the above given save_..._files flags).
process_street_files
Set this flag to True if the inverted indexes for street related attributes (using street related G-NAF files) are to be saved into pickle, shelve and/or text files (according to the above given save_..._files flags).
process_address_files
Set this flag to True if the inverted indexes for address site related attributes (using address related G-NAF files) are to be saved into pickle, shelve and/or text files (according to the above given save_..._files flags).
create_reverse_lookup_shelve
When this flag is set to True a shelve file (a look-up dictionary) will be created which contains G-NAF persistent identifiers (PIDs) as keys and the corresponding records taken from the following five G-NAF files:
G_LOCALITY
G_LOCALITY_ALIAS
G_STREET
G_STREET_LOCALITY_ALIAS
G_ADDRESS_DETAIL
Using the program reverse-gnaf.py (see Section 10.5.2) it is possible to do reverse look-ups, i.e. given one or more G-NAF PIDs one can get the corresponding records containing the original values. This can be helpful for debugging as well as clarification of cases with unexpected geocode matching result.
create_gnaf_address_csv_file
When this flag is set to True a comma separated values (CSV) text file will be created (with the file name set using gnaf_address_csv_file_name) which will contain all the addresses in G-NAF including longitude and latitude, compiled using the following G-NAF files:
G_LOCALITY
G_STREET
G_ADDRESS_DETAIL
G_ADDRESS_SITE_GEOCODE

The following code section shows the above described flags as taken from process-gnaf.py.

# ====================================================================
# Some flags that control the G-NAF pre-processing, set to either True
# or False

check_pid_uniqueness = False

save_pickle_files = True    # Save inverted indexes into binary Python
                            # pickles
save_shelve_files = True    # Save inverted indexes into binary Python
                            # shelves
save_text_files   = True    # Save inverted indexes into text files

process_coll_dist_files = True    # Process collection district files

process_locality_files = True     # Process the G-NAF locality related
                                  # files
process_street_files = True       # Process the G-NAF street related
                                  # files
process_address_files = True      # Process the G-NAF address related
                                  # files

create_reverse_lookup_shelve = True  # Create one large shelve to be
                                     # used for reverse look-ups (i.e.
                                     # given one or more PID find the
                                     # correspnding G-NAF records)

create_gnaf_address_csv_file = True  # Create one large CSV file with
                                     # all G-NAF addresses (values
                                     # merged from several files) and
                                     # their locations

gnaf_address_csv_file_name = 'gnav_address_geocodes.csv'
                                             # Corresponding file name

Several other important settings follow further down in the process-gnaf.py program. They have to be set by the user before a G-NAF pre-processing process can be started.

A project logger can be defined, followed by a Febrl object (that needs to defined). Generally there is no need to modify this. See Chapter 15 on more details of how to configure a project logger.
Tagging look-up tables and correction lists for addresses need to be defined and loaded, similar as done in a standardisation process (see Chapter 5 for an example). These look-up tables and correction lists are used to clean and standardise the G-NAF files in the same way as the user data set is cleaned and standardised later in the geocode matching process (see Section 10.4). Therefore, these look-up tables should be the same as the ones used in the project-geocode.py module, so that cleaning and tagging is done in the same way.
Next the location of the original G-NAF data files (the G-NAF input directory gnaf_input_directory) as well as the directory where all created files will be stored to (the G-NAF output directory gnaf_output_directory) need to be defined.
The file extensions of the G-NAF data files (assumed to be comma separated values text files), as well as the binary shelve and pickle files (for the inverted indexes) have to be defined next (for text and comma separated values output files the standard files extensions .txt and .csv are used, respectively). Note that the file extensions defined must include the initial period (e.g. '.slv' and not just 'slv').
Next, all the G-NAF files have to be defined (they will all be added to the dictionary gnaf_files). Each G-NAF file definition consists of two parts: (1) the file name (without a file extension), and (2) a list of the field (or attribute) names as given in the header line (first line in the files) in each G-NAF file (it is assumed that the first line in all G-NAF files is the header line, and this first line will this not be used for processing). For example, the file definition for the G-NAF file 'G_STATE.csv' containing state information will be:
```
 
      gnaf_files['g_state'] = ['G_STATE',
                               ["STATE_ABBREVIATION",
                                "STATE_NAME"]]
```
Finally, a file name needs to be given in the setting austpost_lookup_file_name to an Australia Post postcode/suburb look-up file (which can be downloaded from http://www.auspost.com.au/postcodes/), as well as a possible state filter austpost_lookup_state_filter (set to None if all Australian postcodes/suburbs should be processed into a look-up table for use while pre-processing the G-NAF locality files). Possible state values are:
'aat' for the Australian Antarctic Territory
'act' for the Australian Capital Territory
'nsw' for New South Wales
'nt' for the Northern Territory
'qld' for Queensland
'sa' for South Australia
'tas' for Tasmania
'vic' for Victoria
'wa' for Western Australia

Once all the settings in process-gnaf.py are adjusted according to a user's needs the G-NAF pre-processing can be started from the command line with:

python process-gnaf.py

Once all G-NAF files are processed into binary inverted index files (pickle and/or shelve files) they will be available in the defined G-NAF output directory, and can then be used by the Febrl geocoding system as explained in Section 10.4.

Warning: While processing G-NAF address related files the program process-gnaf.py uses a large amount of main memory, for example around 3.5 Gigabytes (for processing around 4 million records from New South Wales) if all the pre-processing is done in one pass (if all flags explained above are set to true). If your machine has a smaller amount of main memory swapping (or trashing) will occur increasing the pre-processing times tremendously and slowing down your machine. See Section 10.3.1 for more details.