Processing the G-NAF data files consists of two steps, the first being
the cleaning and standardisation as described above, and the second
being the building of inverted indexes. Such an inverted index is a
keyed hash-table in which the keys are the values from the cleaned
G-NAF data files, and the entries in the hash-table are sets with the
corresponding PIDs (persistent identifiers) of the values. For
example, assume there are four records in the LOCALITY
file
with the following content (the first line is a header-line with the
attribute names)10.1.
locality_pid, locality_name, state_abbrev, postcode 60310919, sydney, nsw, 2000 60709845, north_sydney, nsw, 2059 60309156, north_sydney, nsw, 2060 61560124, the_rocks, nsw, 2000
locality_name
,
state_abbrev
and postcode
then are (square brackets
denote lists and round brackets denote sets):
locality_name_index = ['north_sydney':(60709845,60309156), 'sydney':(60310919), 'the_rocks':(61560124)] state_abbrev_index = ['nsw':(60310919,60709845,60309156,61560124)] postcode_index = ['2000':(60310919,61560124), '2059':(60709845), '2060':(60309156)]
'2000'
would result in a set of PIDs
(60310919,61560124)
, which - when intersected with the
PIDs for locality name value 'the_rocks'
would result in the
single PID set (61560124)
which corresponds to the original
record. The location of this PID can then be look-up in the
corresponding G-NAF geocode index. Table 10.2
shows the attributes for which an inverted index is built.
The Febrl program process-gnaf.py as described in Section 10.3 can be used for doing this pre-processing.
locality_pid
s given
here are not real G-NAF PIDs.