The aim of indexing is to reduce the potentially huge number of comparisons (every record in one data set with all records in another data set) by eliminating comparisons between records that obviously are not matches. In other words, indexing reduces the large search space by forming groups of records that are very likely to be matches. Indexing can also be seen as a clustering method that brings together records that are similar, so only these records need to be compared using the more expensive (i.e. compute intensive) field comparisons functions as explained in Section 9.2 below.
Currently the Febrl system contains several indexing methods, including the traditional blocking method used in many record linkage systems. These indexing methods are implemented in the module indexing.py. Indexes are normally built while a data set is being standardised. After an index is built a compacting has to be done which builds index data structures that can return the blocks more efficiently.
All indexing methods have the following attributes which can (some need) to be given as arguments when an indexing method is initialised.
name
description
dataset
block_definition
(field_name, method, parameters)
. The given
field names must be available in the defined data set. Methods
and parameters are explained in more details in the following
subsections.
skip_missing
True
records which have empty indexing
values will be skipped over, if set to False
a block with
an empty indexing field value will be included in the index as
well. Note that such a block can become quite large if there are
many missing values in a data set, extending the run time of a
linkage or deduplication process significantly. The default
value is True
.
In the following example, three indexes are defined and a traditional
blocking index is initialised. The first index is based on the
Soundex encoding of the reversed surname field
values, with a maximal code length of three, the second index is based
on a combination of the first two characters in the given name field
(the truncate
method is used for this) concatenated with the
values in the postcode field (the method is direct
which means
the postcode values are not encoded or modified in any way), and the
third index is a based on concatenation of the first two digits in the
postcode plus the NYSIIS encoding of the surname
values.
# ==================================================================== hosp_block_def = [[('surname','soundex', 3, 'reverse')], [('givenname','truncate',2), ('postcode','direct')], [('postcode','truncate',2), ('surname','nysiis')], ] hospital_index = BlockingIndex(name = 'HospIndex', dataset = tmpdata, block_definition = hosp_block_def)
As can be seen from the example, an indexing definition is made of a
Python list (square brackets), with one or more index definitions
inside, each again being a Python list (one per line in the above
example). An index definition itself is made of Python tuples of the
form (field_name, method, parameters)
. Field names must be
valid field names from the data set the index is applied on (see
argument dataset
in the index definition). The method
can be set to one of the following methods (with corresponding
parameters).
direct
soundex
'reverse'
. When given, the values in the
field are reversed before they are encoded. The default value
for the maximal length is 4.
mod_soundex
'reverse'
as with Soundex can be
given.
nysiis
'reverse'
as with Soundex
can be given.
phonex
'reverse'
as with Soundex can be
given.
dmetaphone
'reverse'
as with Soundex can be given.
truncate
In the following sections the available indexing methods are described in more details and their special arguments are listed.