14. Look-up and Frequency Tables

In the data cleaning and standardisation process, correction lists and look-up tables with word corrections and expansions are needed, and in the linkage process frequency tables can be used to compute matching weight probabilities for various components of names and addresses14.1. For geocoding, neighbouring region look-up tables are needed. These lists and tables are stored in text files and can be created and edited by the user. Five different types of look-up tables and corresponding file formats as used in the Febrl system are described in this chapter. Their access routines are implemented in the module lookup.py.

For each of these five different tables or lists the corresponding file format is described in the following sections. Common to all file formats is that lines starting with the hash character '#' are comment lines, and their content is skipped. Note that if a '#' character is not at the beginning of a line it is not interpreted as the start of a comment. Instead it will be used as a normal part of a look-up table or correction list entry.

For the tagging, frequency, geographic location and neighbouring region look-up tables, it is possible to load more than one file into one combined look-up table (see the examples given in the following sections).

Also for the tagging, frequency, geographic location and neighbouring region look-up tables an additional argument that can be given when a table is initialised is default, a Python object (e.g. string or list) that is returned when the table is queried for a value that is not stored in the table.



Footnotes

... addresses14.1
 The AutoMatch as formerly sold by MatchWare Technologies derived quite small frequency weight tables directly from the input files. The Febrl system allows much larger frequency weight tables derived from external sources (such as a telephone directory) to be used, hence the need to specify the format of the frequency table files.



Subsections