9.1 Indexing

The aim of indexing is to reduce the potentially huge number of comparisons (every record in one data set with all records in another data set) by eliminating comparisons between records that obviously are not matches. In other words, indexing reduces the large search space by forming groups of records that are very likely to be matches. Indexing can also be seen as a clustering method that brings together records that are similar, so only these records need to be compared using the more expensive (i.e. compute intensive) field comparisons functions as explained in Section 9.2 below.

Currently the Febrl system contains several indexing methods, including the traditional blocking method used in many record linkage systems. These indexing methods are implemented in the module indexing.py. Indexes are normally built while a data set is being standardised. After an index is built a compacting has to be done which builds index data structures that can return the blocks more efficiently.

Note: For a linkage process the index methods used must be the same for both data sets (i.e. it is not possible to use a blocking index for one and data set a sorted neighbourhood index for the other).

All indexing methods have the following attributes which can (some need) to be given as arguments when an indexing method is initialised.

name
A name for an indexing method. This should be a short string.
description
A longer description for an indexing method. This attribute is not necessary, but can be useful for documenting the indexing method.
dataset
A reference to the data set the index is built for (i.e. when the index is built it is assumed that records are having the fields as defined in this data set).
block_definition
The definition of how the index and its blocks should be built. Blocks are defined using a list of lists each containing tuples of the form (field_name, method, parameters). The given field names must be available in the defined data set. Methods and parameters are explained in more details in the following subsections.
skip_missing
A flag, if set to True records which have empty indexing values will be skipped over, if set to False a block with an empty indexing field value will be included in the index as well. Note that such a block can become quite large if there are many missing values in a data set, extending the run time of a linkage or deduplication process significantly. The default value is True.

In the following example, three indexes are defined and a traditional blocking index is initialised. The first index is based on the Soundex encoding of the reversed surname field values, with a maximal code length of three, the second index is based on a combination of the first two characters in the given name field (the truncate method is used for this) concatenated with the values in the postcode field (the method is direct which means the postcode values are not encoded or modified in any way), and the third index is a based on concatenation of the first two digits in the postcode plus the NYSIIS encoding of the surname values.

# ====================================================================

hosp_block_def = [[('surname','soundex', 3, 'reverse')],
                  [('givenname','truncate',2), ('postcode','direct')],
                  [('postcode','truncate',2), ('surname','nysiis')],
                 ]

hospital_index = BlockingIndex(name = 'HospIndex',
                            dataset = tmpdata,
                   block_definition = hosp_block_def)

As can be seen from the example, an indexing definition is made of a Python list (square brackets), with one or more index definitions inside, each again being a Python list (one per line in the above example). An index definition itself is made of Python tuples of the form (field_name, method, parameters). Field names must be valid field names from the data set the index is applied on (see argument dataset in the index definition). The method can be set to one of the following methods (with corresponding parameters).

direct
The field values are directly taken as (part of) the blocking variables. No additional parameter is needed.
soundex
The Soundex phonetic encoding of the values in the given field are taken. A possible first parameter is the maximal length of the encoding (in characters), and a second parameter can be the word 'reverse'. When given, the values in the field are reversed before they are encoded. The default value for the maximal length is 4.
mod_soundex
A modified version of the Soundex algorithm. The same parameters maximal length and 'reverse' as with Soundex can be given.
nysiis
The NYSIIS phonetic encoding algorithm. The same parameters maximal length and 'reverse' as with Soundex can be given.
phonex
The Phonex [21] phonetic encoding algorithm which combines ideas from both Soundex and NYSIIS. The same parameters maximal length and 'reverse' as with Soundex can be given.
dmetaphone
The Double-metaphone [28] phonetic encoding algorithm. The same parameters maximal length and 'reverse' as with Soundex can be given.
truncate
A string truncation method, where as additional parameter the length must be given (i.e. strings longer than the given length are truncated).

Note that all encoding methods are implemented in the module encode.py.

In the following sections the available indexing methods are described in more details and their special arguments are listed.