10. Geocoding

Geocoding is the task of creating geographical coordinates (longitude and latitude) from address information, and is a first step in many spatial data analysis projects. The geocoding task involves the matching (or linking) of a geocoded reference data set with a user's data set that contains the addresses to be geocoded. This involves the same issues and problems as with general record linkage, namely that addresses in a user data set might be formatted differently, addresses don't exist in the reference data set, street names or postcodes might have changed, or often addresses in a user data set are incomplete.

The US Federal Geographic Data Committee estimates that geographic location is a key feature in $80\%$ to $90\%$ of governmental data collections [34]. In many cases, addresses are the key to spatially enable data. The aim of geocoding is to generate a geographical location (longitude and latitude) from street address information in the data source. Once geocoded, the data can be used for further processing, in spatial data mining [12] projects, and it can be visualised and combined with other data using Geographical Information Systems (GIS).

The applications of spatial data analysis and mining are widespread. In the health sector, for example, geocoded data can be used to find local clusters of disease. Environmental health studies often rely on GIS and geocoding software to map areas of potential exposure and to locate where people live in relation to these areas. Geocoded data can also help in the planning of new health resources, e.g. additional health care providers can be allocated close to where there is an increased need for services. An overview of geographical health issues is given in [5]. When combined with a street navigation system, accurate geocoded data can assist emergency services find the location of a reported emergency (for example, if a caller reports an incomplete or unclear address).

Geocoded customer data, combined with additional demographic data, can help businesses to better plan marketing and future expansion, and the analysis of historical geocoded data, for example, can show changes in their customer base. Within census, geocoding can be used to assign people or households to small area units, for example census collection districts, which are then the basis of further statistical analysis.

There are two basic scenarios for geocoding user data. In the first, a user wants to automatically geocode a data set. The geocoding system should find the best possible match for each record in the user data set without human intervention. Each record needs to be attributed with the corresponding location plus a match status which indicates the accuracy of the match obtained (for example an exact address match, or a street level match, or a locality level match). This scenario might become problematic if the user data is not of high quality, and contains records with missing, incorrect or out-of-date address information. Typographical errors are common with addresses, especially when they are recorded over the telephone or from hand-written forms. As reported in [27], a match rate of $70\%$ successfully geocoded records is often considered an acceptable result. In the second scenario a user wants to geocode a single address that may be incomplete, erroneous or unformatted. The system should return the location if an exact match can be found, or alternatively a list of possible matches, together with a matching status and a likelihood rating. This geocoding of a single record should be done in (near) real time (i.e. less than a couple of seconds response time) and be available via a suitable user interface (e.g. a Web site).

Figure 10.1: Example geocoding using property parcel centres (numbers 1 to 7) and street reference file centreline (dashed line and numbers 8 to 13, with the dotted lines corresponding to a global street offset).

Traditional geocoding systems therefore often work in a hierarchical fashion. First, an address matching (i.e. street name, type and number, and if available unit numbers, building names, etc.) is attempted, which hopefully results in an exact address level one-to-one match between a user record and a reference record - resulting in an exact location for the user record. If the full address level matching fails, the next coarser matching is on street type and name only, resulting in an street level match (in which case a street centroid and/or bounding box will be assigned to the user record). If even the street name doesn't match, the next coarser area will normally be the locality (suburb or postcode) resulting in a locality level match, with a centroid location and/or bounding box locations being assigned to the user record.

The geocode matching implemented in Febrl works in a slightly different way, in that - if no exact address level match can be found - neighbouring regions (i.e. suburb and postcodes) will be searched first, to see if an exact address level match can be found in a neighbouring region. The Febrl matching also implements fuzzy matching, and if no exact match can be found a list of possible matches (each with a matching score) will be returned.

Many GIS software packages provide for street level geocoding. As a recent study shows [6], substantial differences in positional error exist between addresses which are geocoded using street reference files (containing geographic centreline coordinates, street numbers and names, and postcodes) and the corresponding true locations. The use of point property parcel coordinates (i.e. the centres or centroids of properties), derived from cadastral data, is expected to significantly reduce these positional errors. Figure 10.1 gives an illustrative example. Even small discrepancies in geocoding can result in addresses being assigned to, for example, different census collection districts, which can have huge implications when doing small area analysis. A comprehensive property based database is now available for Australia: the Geocoded National Address File (G-NAF) [27]. G-NAF is used as geocode reference file within the Febrl geocoding process. In Section 10.1 G-NAF is presented in more details.

In order to be able to do geocode matching in Febrl, the geocoded reference data set first needs to be pre-processed (using the program process-gnaf.py as described in Section 10.3 below). This pre-processing includes the same data cleaning and standardisation routines as used for address standardisation (as described in Chapter 6). This means that the geocode reference records are in a similar form (i.e. the same standardised fields or attributes) as a standardised user data set, thereby increasing the chances of successful matching. The pre-processed reference data set is stored in efficient binary files as inverted indexes, i.e. each field (or attribute) is stored in a separate binary file, and the geocode matching engine works by (efficiently) computing the intersections of candidate matching record sets.

The next section presents G-NAF in more details, followed by an outline of the Febrl geocoding process in Section 10.2. In Section 10.3 the pre-processing program process-gnaf.py is discussed, and Section 10.4 describes in detail how a geocoding process can be set up (using a module based on project-geocode.py) and various examples of geocode matching are shown. Some auxiliary programs are part of the geocoding facilities of Febrl, and they will be presented in Section 10.5. A list of outstanding tasks and ideas for improvements is finally given in Section 10.5.3.

Note: G-NAF is an Australian database and thus very specific. To our knowledge it is a unique project. Therefore, the Febrl geocoding system cannot be easily ported to other countries. We will not be able to provide any support for developing an international version of the Febrl geocoding system. However, proficient Python programmers should be able to modify the necessary modules within a reasonable amount of time.