10.2.1 Processing the G-NAF Files

Processing the G-NAF data files consists of two steps, the first being the cleaning and standardisation as described above, and the second being the building of inverted indexes. Such an inverted index is a keyed hash-table in which the keys are the values from the cleaned G-NAF data files, and the entries in the hash-table are sets with the corresponding PIDs (persistent identifiers) of the values. For example, assume there are four records in the LOCALITY file with the following content (the first line is a header-line with the attribute names)10.1.

        locality_pid,  locality_name,  state_abbrev,  postcode
        60310919,      sydney,         nsw,           2000
        60709845,      north_sydney,   nsw,           2059
        60309156,      north_sydney,   nsw,           2060
        61560124,      the_rocks,      nsw,           2000
The inverted indexes for the three attributes locality_name, state_abbrev and postcode then are (square brackets denote lists and round brackets denote sets):
locality_name_index = ['north_sydney':(60709845,60309156),

state_abbrev_index = ['nsw':(60310919,60709845,60309156,61560124)]

postcode_index = ['2000':(60310919,61560124),
The matching engine then has to find intersections of the inverted index sets for the values in a given record. For example, a postcode value '2000' would result in a set of PIDs (60310919,61560124), which - when intersected with the PIDs for locality name value 'the_rocks' would result in the single PID set (61560124) which corresponds to the original record. The location of this PID can then be look-up in the corresponding G-NAF geocode index. Table 10.2 shows the attributes for which an inverted index is built.

Table 10.2: G-NAF attributes used for geocode matching.
\begin{tableii}{l\vert l}{textrm}{G-NAF data file}{...
...ame, street\_type,

The Febrl program process-gnaf.py as described in Section 10.3 can be used for doing this pre-processing.


... names)10.1
Note that the locality_pids given here are not real G-NAF PIDs.