14. Look-up and Frequency Tables
In the data cleaning and standardisation process,
correction lists and look-up tables with word corrections and
expansions are needed, and in the linkage process frequency tables can
be used to compute matching weight probabilities for various
components of names and addresses14.1. For geocoding, neighbouring region look-up tables are needed.
These lists and tables are stored in text files and can be created and
edited by the user. Five different types of look-up tables and
corresponding file formats as used in the Febrl system are
described in this chapter. Their access routines are implemented in
the module lookup.py.
- Correction list
These lists contain strings (characters or words) and their
replacements. They are used in the initial cleaning step in the
data standardisation process.
- Tagging look-up table
Similar to correction lists, these tables contain strings and
their replacement. Additionally, groups of table entries are
assigned a tag, which is then used in the tagging step within
the data standardisation process. Tagging look-up tables are
mainly used for names and addresses.
- Frequency look-up table
These tables contain words and integer numbers that correspond
to the frequency of the word as listed in an external frequency
file. Such frequency tables are used within the record linkage
process to calculate frequency dependent matching weights.
- Geographic location look-up table
These tables contain words and their corresponding geographical
location given as a numerical longitude and latitude pair. For
example, postcode locations can be used to calculate the
distance between two postcodes, which can then be used in a
field comparison function to calculate a matching weight.
- Neighbouring region look-up table
These tables contain region values, for example postcodes or
suburb names, and for each of them a list of their neighbouring
regions. These look-up tables are used in the geocoding process
for matching of records for which no exact match can be found.
See Chapter 10 for more details about
For each of these five different tables or lists the corresponding
file format is described in the following sections. Common to all file
formats is that lines starting with the hash character
comment lines, and their content is skipped. Note that if a
'#' character is not at the beginning of a line it is not
interpreted as the start of a comment. Instead it will be used as a
normal part of a look-up table or correction list entry.
For the tagging, frequency, geographic location and neighbouring
region look-up tables, it is possible to load more than one file into
one combined look-up table (see the examples given in the following
Also for the tagging, frequency, geographic location and neighbouring
region look-up tables an additional argument that can be given when a
table is initialised is
default, a Python object
(e.g. string or list) that is returned when the table is queried for a
value that is not stored in the table.
- ... addresses14.1
- The AutoMatch as
formerly sold by MatchWare Technologies derived quite small
frequency weight tables directly from the input files. The
Febrl system allows much larger frequency weight tables
derived from external sources (such as a telephone directory) to be
used, hence the need to specify the format of the frequency table