An overview of the Febrl geocoding system is shown in Figure 10.3. The geocoding process can be split into the pre-processing of the G-NAF data files (which is described in more detail in Section 10.2.1), and the fuzzy matching with user-supplied addresses as presented in Section 10.2.3.
The pre-processing step takes the G-NAF data files and uses the Febrl address cleaning and standardisation routines to convert the detailed address values (like street names, types and suffixes, house numbers and suffixes, flat types and numbers, locality names, postcodes, etc.) into a form which makes them consistent with the user data after Febrl standardisation. Note that the G-NAF data files already come in a highly standardised form, but the finer details, for example how whitespaces within locality names are treated, make the difference between successful or failed matching. The cleaned and standardised reference records are then inserted into a number of inverted index data structures. See Chapters 6, 8 and 7 for details about the Febrl standardisation process.
Additional data used in the pre-processing step are a postcode-suburb look-up table which is publicly available, and which can be used to impute missing postcodes or suburb values in the G-NAF locality files; and a table extracted from a commercial GIS system containing postcode and suburb boundary information, which is used to create neighbouring region look-up tables. Section 10.2.2 discusses the use of these files in more details.
The geocode matching engine takes as input the inverted indexes and the raw user data, which is cleaned and standardised before geocoding is attempted. As shown in Figure 10.3, the user data can either be loaded from a data file, geocoded and then stored back into a data file, or it can be passed as a single address to the geocoding system and returned via a Web interface.
Because the G-NAF data files are already available in a high quality standardised form, only the first two steps of the Febrl address cleaning and standardisation approach (cleaning and tagging) are needed for G-NAF attributes that contain strings, for example street and locality names. The aim of this standardisation is to make sure the finer details - like using an underscore in locality names as in the example above - are the same in both the standardised G-NAF data and the user data. No cleaning is needed at all for attributes containing numbers only, like street and flat numbers, or postcodes.