14. Look-up and Frequency Tables

In the data cleaning and standardisation process, correction lists and look-up tables with word corrections and expansions are needed, and in the linkage process frequency tables can be used to compute matching weight probabilities for various components of names and addresses^14.1. For geocoding, neighbouring region look-up tables are needed. These lists and tables are stored in text files and can be created and edited by the user. Five different types of look-up tables and corresponding file formats as used in the Febrl system are described in this chapter. Their access routines are implemented in the module lookup.py.

Correction list
These lists contain strings (characters or words) and their replacements. They are used in the initial cleaning step in the data standardisation process.
Tagging look-up table
Similar to correction lists, these tables contain strings and their replacement. Additionally, groups of table entries are assigned a tag, which is then used in the tagging step within the data standardisation process. Tagging look-up tables are mainly used for names and addresses.
Frequency look-up table
These tables contain words and integer numbers that correspond to the frequency of the word as listed in an external frequency file. Such frequency tables are used within the record linkage process to calculate frequency dependent matching weights.
Geographic location look-up table
These tables contain words and their corresponding geographical location given as a numerical longitude and latitude pair. For example, postcode locations can be used to calculate the distance between two postcodes, which can then be used in a field comparison function to calculate a matching weight.
Neighbouring region look-up table
These tables contain region values, for example postcodes or suburb names, and for each of them a list of their neighbouring regions. These look-up tables are used in the geocoding process for matching of records for which no exact match can be found. See Chapter 10 for more details about geocoding.

For each of these five different tables or lists the corresponding file format is described in the following sections. Common to all file formats is that lines starting with the hash character '#' are comment lines, and their content is skipped. Note that if a '#' character is not at the beginning of a line it is not interpreted as the start of a comment. Instead it will be used as a normal part of a look-up table or correction list entry.

For the tagging, frequency, geographic location and neighbouring region look-up tables, it is possible to load more than one file into one combined look-up table (see the examples given in the following sections).

Also for the tagging, frequency, geographic location and neighbouring region look-up tables an additional argument that can be given when a table is initialised is default, a Python object (e.g. string or list) that is returned when the table is queried for a value that is not stored in the table.

Footnotes

... addresses ^14.1: The AutoMatch as formerly sold by MatchWare Technologies derived quite small frequency weight tables directly from the input files. The Febrl system allows much larger frequency weight tables derived from external sources (such as a telephone directory) to be used, hence the need to specify the format of the frequency table files.