13. Look-up and Frequency Tables
In the data cleaning and standardisation process,
correction lists and look-up tables with word corrections and
expansions are needed, and in the linkage process frequency tables can
be used to compute matching weight probabilities for various
components of names and addresses13.1. These lists and tables are stored in text files and can be
created and edited by the user. Four different types of look-up tables
and corresponding file formats as used in the Febrl system
are described in this chapter. Their access routines are implemented
in the module lookup.py.
- Correction list
These lists contain strings (characters or words) and their
replacements. They are used in the initial cleaning step in the
data standardisation process.
- Tagging look-up table
Similar to correction lists, these tables contain strings and
their replacement. Additionally, groups of table entries are
assigned a tag, which is then used in the tagging step within
the data standardisation process. Tagging look-up tables are
mainly used for names and addresses.
- Frequency look-up table
These tables contain words and integer numbers that correspond
to the frequency of the word as listed in an external frequency
file. Such frequency tables are used within the record linkage
process to calculate frequency dependent matching weights.
- Geographic location look-up table
These tables contain words and their corresponding geographical
location given as a numerical longitude and latitude pair. For
example, postcode locations can be used to calculate the
distance between two postcodes, which can then be used in a
field comparison function to calculate a matching weight.
For each of these four different tables or lists the corresponding
file format is described in the following sections. Common to all file
formats is that lines starting with the hash character '#'
are
comment lines, and their content is skipped. Note that if a
'#'
character is not at the beginning of a line it is not
interpreted as the start of a comment. Instead it will be used as a
normal part of a look-up table or correction list entry.
For the tagging, frequency and geographic location look-up tables, it
is possible to load more than one file into one combined look-up
table (see the examples given in the following sections).
Also for the tagging, frequency and geographic location look-up tables
an additional argument that can be given when a table is initialised
is default
, a string that is returned when the table is queried
for a value that is not stored in the table.
Footnotes
- ... addresses13.1
- The AutoMatch as
formerly sold by MatchWare Technologies derived quite small
frequency weight tables directly from the input files. The
Febrl system allows much larger frequency weight tables
derived from external sources (such as a telephone directory) to be
used, hence the need to specify the format of the frequency table
files.
Subsections