In order to be able to efficiently use the G-NAF data set for geocoding, the necessary G-NAF files need to be cleaned, pre-processed and indexed so that matching records (and their longitude and latitude) can be retrieved in an efficient and fast way.
The program process-gnaf.py (available in the
directory) does exactly this pre-processing of the G-NAF files, which
are assumed to be available as CSV (comma separated values) text
files. The main computations routines
used in process-gnaf.py are implemented in the module
gnaffunctions.py which is available in the directory
As discussed in Section 10.1, G-NAF consists of many files containing the normalised address, street and locality data, geocoding information (for address sites, streets and localities), as well as various alias information.
All settings for process-gnaf.py need to be specified by the user within the program itself. The main processing flags or switches that control what kind of pre-processing is performed are discussed first followed by all other process-gnaf.py settings.
At the beginning of process-gnaf.py - after the license
header, module description and module imports - is a code section
which contains a number of flags (or switches) which can be set
False in order to enable or disable the
processing of parts of the G-NAF files. The following flags can be
Trueall PIDs (persistent identifiers) in the G-NAF files
G_LOCALITYare checked for their uniqueness, and errors are reported.
Trueif the inverted indexes are to be saved into Python pickle files. See the Python documentation (module index) for information about the pickle module.
Trueif the inverted indexes are to be saved into Python shelve files. See the Python documentation (module index) for information about the shelve module.
Trueif the inverted indexes are to be saved into text files. This is mainly for debugging purpose, so that the inverted indexes can be viewed easily. These text files can not be used by the geocoding process.
True(and define the corresponding file details as described below) if such collection district files are to be processed (and saved into pickle, shelve and/or text files according to the above given
Trueif the inverted indexes for locality related attributes (using locality related G-NAF files) are to be saved into pickle, shelve and/or text files (according to the above given
Trueif the inverted indexes for street related attributes (using street related G-NAF files) are to be saved into pickle, shelve and/or text files (according to the above given
Trueif the inverted indexes for address site related attributes (using address related G-NAF files) are to be saved into pickle, shelve and/or text files (according to the above given
Truea shelve file (a look-up dictionary) will be created which contains G-NAF persistent identifiers (PIDs) as keys and the corresponding records taken from the following five G-NAF files:
Truea comma separated values (CSV) text file will be created (with the file name set using
gnaf_address_csv_file_name) which will contain all the addresses in G-NAF including longitude and latitude, compiled using the following G-NAF files:
# ==================================================================== # Some flags that control the G-NAF pre-processing, set to either True # or False check_pid_uniqueness = False save_pickle_files = True # Save inverted indexes into binary Python # pickles save_shelve_files = True # Save inverted indexes into binary Python # shelves save_text_files = True # Save inverted indexes into text files process_coll_dist_files = True # Process collection district files process_locality_files = True # Process the G-NAF locality related # files process_street_files = True # Process the G-NAF street related # files process_address_files = True # Process the G-NAF address related # files create_reverse_lookup_shelve = True # Create one large shelve to be # used for reverse look-ups (i.e. # given one or more PID find the # correspnding G-NAF records)
create_gnaf_address_csv_file = True # Create one large CSV file with # all G-NAF addresses (values # merged from several files) and # their locations gnaf_address_csv_file_name = 'gnav_address_geocodes.csv' # Corresponding file name
Several other important settings follow further down in the process-gnaf.py program. They have to be set by the user before a G-NAF pre-processing process can be started.
gnaf_input_directory) as well as the directory where all created files will be stored to (the G-NAF output directory
gnaf_output_directory) need to be defined.
.csvare used, respectively). Note that the file extensions defined must include the initial period (e.g.
'.slv'and not just
gnaf_files). Each G-NAF file definition consists of two parts: (1) the file name (without a file extension), and (2) a list of the field (or attribute) names as given in the header line (first line in the files) in each G-NAF file (it is assumed that the first line in all G-NAF files is the header line, and this first line will this not be used for processing). For example, the file definition for the G-NAF file
'G_STATE.csv'containing state information will be:
gnaf_files['g_state'] = ['G_STATE', ["STATE_ABBREVIATION", "STATE_NAME"]]
austpost_lookup_file_nameto an Australia Post postcode/suburb look-up file (which can be downloaded from http://www.auspost.com.au/postcodes/), as well as a possible state filter
Noneif all Australian postcodes/suburbs should be processed into a look-up table for use while pre-processing the G-NAF locality files). Possible state values are:
'aat'for the Australian Antarctic Territory
'act'for the Australian Capital Territory
'nsw'for New South Wales
'nt'for the Northern Territory
'sa'for South Australia
'wa'for Western Australia
Once all the settings in process-gnaf.py are adjusted according to a user's needs the G-NAF pre-processing can be started from the command line with: