In order to be able to efficiently use the G-NAF data set for geocoding, the necessary G-NAF files need to be cleaned, pre-processed and indexed so that matching records (and their longitude and latitude) can be retrieved in an efficient and fast way.
The program process-gnaf.py (available in the geocode
directory) does exactly this pre-processing of the G-NAF files, which
are assumed to be available as CSV (comma separated values) text
files. The main computations routines
used in process-gnaf.py are implemented in the module
gnaffunctions.py which is available in the directory
geocode
.
As discussed in Section 10.1, G-NAF consists of many files containing the normalised address, street and locality data, geocoding information (for address sites, streets and localities), as well as various alias information.
All settings for process-gnaf.py need to be specified by the user within the program itself. The main processing flags or switches that control what kind of pre-processing is performed are discussed first followed by all other process-gnaf.py settings.
At the beginning of process-gnaf.py - after the license
header, module description and module imports - is a code section
which contains a number of flags (or switches) which can be set
to True
or False
in order to enable or disable the
processing of parts of the G-NAF files. The following flags can be
set.
check_pid_uniqueness
True
all PIDs (persistent identifiers) in the
G-NAF files G_ADDRESS_DETAILS
, G_ADDRESS_SITE
,
G_STREET
, and G_LOCALITY
are checked for their
uniqueness, and errors are reported.
save_pickle_files
True
if the inverted indexes are to be
saved into Python pickle files. See the Python documentation
(module index) for information about the pickle module.
save_shelve_files
True
if the inverted indexes are to be
saved into Python shelve files. See the Python documentation
(module index) for information about the shelve module.
save_text_files
True
if the inverted indexes are to be
saved into text files. This is mainly for debugging purpose, so
that the inverted indexes can be viewed easily. These text files
can not be used by the geocoding process.
process_coll_dist_files
True
(and define the corresponding file
details as described below) if such collection district files
are to be processed (and saved into pickle, shelve and/or text
files according to the above given save_..._files
flags).
process_locality_files
True
if the inverted indexes for
locality related attributes (using locality related G-NAF files)
are to be saved into pickle, shelve and/or text files (according
to the above given save_..._files
flags).
process_street_files
True
if the inverted indexes for street
related attributes (using street related G-NAF files) are to be
saved into pickle, shelve and/or text files (according to the
above given save_..._files
flags).
process_address_files
True
if the inverted indexes for
address site related attributes (using address related G-NAF
files) are to be saved into pickle, shelve and/or text files
(according to the above given save_..._files
flags).
create_reverse_lookup_shelve
True
a shelve file (a look-up
dictionary) will be created which contains G-NAF persistent
identifiers (PIDs) as keys and the corresponding records taken
from the following five G-NAF files:
G_LOCALITY
G_LOCALITY_ALIAS
G_STREET
G_STREET_LOCALITY_ALIAS
G_ADDRESS_DETAIL
create_gnaf_address_csv_file
True
a comma separated values
(CSV) text file will be created (with the file name set using
gnaf_address_csv_file_name
) which will contain all
the addresses in G-NAF including longitude and latitude,
compiled using the following G-NAF files:
G_LOCALITY
G_STREET
G_ADDRESS_DETAIL
G_ADDRESS_SITE_GEOCODE
# ==================================================================== # Some flags that control the G-NAF pre-processing, set to either True # or False check_pid_uniqueness = False save_pickle_files = True # Save inverted indexes into binary Python # pickles save_shelve_files = True # Save inverted indexes into binary Python # shelves save_text_files = True # Save inverted indexes into text files process_coll_dist_files = True # Process collection district files process_locality_files = True # Process the G-NAF locality related # files process_street_files = True # Process the G-NAF street related # files process_address_files = True # Process the G-NAF address related # files create_reverse_lookup_shelve = True # Create one large shelve to be # used for reverse look-ups (i.e. # given one or more PID find the # correspnding G-NAF records)
create_gnaf_address_csv_file = True # Create one large CSV file with # all G-NAF addresses (values # merged from several files) and # their locations gnaf_address_csv_file_name = 'gnav_address_geocodes.csv' # Corresponding file name
Several other important settings follow further down in the process-gnaf.py program. They have to be set by the user before a G-NAF pre-processing process can be started.
gnaf_input_directory
) as well as the
directory where all created files will be stored to (the G-NAF
output directory gnaf_output_directory
) need to be
defined.
.txt
and .csv
are used,
respectively). Note that the file extensions defined must
include the initial period (e.g. '.slv'
and not just
'slv'
).
gnaf_files
). Each G-NAF file
definition consists of two parts: (1) the file name (without a
file extension), and (2) a list of the field (or attribute)
names as given in the header line (first line in the files) in
each G-NAF file (it is assumed that the first line in all G-NAF
files is the header line, and this first line will this not
be used for processing). For example, the file definition for
the G-NAF file 'G_STATE.csv'
containing state
information will be:
gnaf_files['g_state'] = ['G_STATE', ["STATE_ABBREVIATION", "STATE_NAME"]]
austpost_lookup_file_name
to an Australia Post
postcode/suburb look-up file (which can be downloaded from
http://www.auspost.com.au/postcodes/), as well as a
possible state filter austpost_lookup_state_filter
(set to None
if all Australian postcodes/suburbs should
be processed into a look-up table for use while pre-processing
the G-NAF locality files). Possible state values are:
'aat'
for the Australian Antarctic
Territory
'act'
for the Australian Capital
Territory
'nsw'
for New South Wales
'nt'
for the Northern Territory
'qld'
for Queensland
'sa'
for South Australia
'tas'
for Tasmania
'vic'
for Victoria
'wa'
for Western Australia
Once all the settings in process-gnaf.py are adjusted according to a user's needs the G-NAF pre-processing can be started from the command line with:
python process-gnaf.py