9.1 Indexing

The aim of indexing is to reduce the potentially huge number of comparisons (every record in one data set with all records in another data set) by eliminating comparisons between records that obviously are not matches. In other words, indexing reduces the large search space by forming groups of records that are very likely to be matches. Indexing can also be seen as a clustering method that brings together records that are similar, so only these records need to be compared using the more expensive (i.e. compute intensive) field comparisons functions as explained in Section 9.2 below.

Currently the Febrl system contains several indexing methods, including the traditional blocking method used in many record linkage systems. These indexing methods are implemented in the module indexing.py. Indexes are normally built while a data set is being standardised. After an index is built a compacting has to be done which builds index data structures that can return the blocks more efficiently.

Note: For a linkage process the index methods used must be the same for both data sets (i.e. it is not possible to use a blocking index for one and data set a sorted neighbourhood index for the other).

All indexing methods have the following attributes which can (some need) to be given as arguments when an indexing method is initialised.

In the following example, three indexes are defined and a traditional blocking index is initialised. The first index is based on the Soundex encoding of the reversed surname field values, with a maximal code length of three, the second index is based on a combination of the first two characters in the given name field (the truncate method is used for this) concatenated with the values in the postcode field (the method is direct which means the postcode values are not encoded or modified in any way), and the third index is a based on concatenation of the first two digits in the postcode plus the NYSIIS encoding of the surname values.

# ====================================================================

hosp_block_def = [[('surname','soundex', 3, 'reverse')],
                  [('givenname','truncate',2), ('postcode','direct')],
                  [('postcode','truncate',2), ('surname','nysiis')],
                 ]

hospital_index = BlockingIndex(name = 'HospIndex',
                            dataset = tmpdata,
                   block_definition = hosp_block_def)

As can be seen from the example, an indexing definition is made of a Python list (square brackets), with one or more index definitions inside, each again being a Python list (one per line in the above example). An index definition itself is made of Python tuples of the form (field_name, method, parameters). Field names must be valid field names from the data set the index is applied on (see argument dataset in the index definition). The method can be set to one of the following methods (with corresponding parameters).

Note that all encoding methods are implemented in the module encode.py.

In the following sections the available indexing methods are described in more details and their special arguments are listed.



Subsections