This index implements a data structure based on bigrams and allows
for fuzzy blocking. The basic idea is that after an index has
been built, the values of the blocking variables will be converted
into a list of bigrams, and permutations of sub-lists will be built
using a given threshold (a number between and
) of all
possible permutations. The resulting bigram lists will be inserted
into an inverted index, i.e. record numbers in the blocks will be
inserted into Python dictionaries for each bigram. Such an inverted
index will then be used to retrieve the blocks.
When a bigram index is initialised, one argument (besides the base class arguments as presented in Section 9.1 above) that needs to be given is:
threshold
For example, assume a block definition contains the tuple
block_definition = [[('givname','direct')], ...]
and the bigram threshold is set to . If a value
'peter'
is
given in the 'givname'
(given name) field, the corresponding
bigram list will be ['pe','et','te','er']
with four elements,
so using the threshold results in
rounded to
,
which means all permutations of length 3 are calculated. For the given
example they are
['pe','et','te']
['pe','et','er']
['pe','te','er']
['et','te','er']
So, the corresponding record number will be inserted into the inverted
index blocks with keys 'peette'
, 'peeter'
,
'peteer
, and 'etteer'
.
The lower the threshold, the shorter the sub-lists, but the more sub-lists there will be per field value, resulting in more (smaller blocks) in the inverted index.
The following example shows how to define and initialise a bigram index.
# ==================================================================== hosp_block_def = [[('surname','soundex', 3, 'reverse')], [('givenname','truncate',2), ('postcode','direct')], [('postcode','truncate',2), ('surname','nysiis')], ] hospital_index = BigramIndex(name = 'HospIndex', dataset = tmpdata, block_definition = hosp_block_def, threshold = 0.75)