This index implements a data structure based on bigrams and allows for
fuzzy blocking. The basic idea is that after an index has been
built, the values of the blocking variables will be converted into a
list of bigrams, which is sorted alphabetically (and duplicate bigrams
are removed), and sub-lists will be built using a user provided
threshold (a number between and
) of all possible
combinations. These resulting bigram sub-lists will be inserted into
an inverted index, i.e. record numbers in the blocks will be inserted
into Python dictionaries for each bigram sub-list. Such an inverted
index will then be used to retrieve the blocks.
When a bigram index is initialised, one argument (besides the base class arguments as presented in Section 9.1 above) that needs to be given is:
threshold
For example, assume a block definition contains the tuple
block_definition = [[('givname','direct')], ...]
and the bigram threshold is set to . If a value
'peter'
is
given in the 'givname'
(given name) field, the corresponding
bigram list will be ['pe','et','te','er']
with four elements.
It will be sorted to ['er','et','pe','te']
, and using the
threshold results in
rounded to
, which means all
combinations of length 3 are calculated. For the given example they
are
['et','pe','te']
['er','pe','te']
['er','et','te']
['er','et','pe']
So, the corresponding record number will be inserted into the inverted
index blocks with keys 'etpete'
, 'erpete'
,
'erette'
, and 'eretpe'
.
The lower the threshold, the shorter the sub-lists, but the more sub-lists there will be per field value, resulting in more (smaller blocks) in the inverted index.
The following example shows how to define and initialise a bigram index.
# ==================================================================== hosp_block_def = [[('surname','soundex', 3, 'reverse')], [('givenname','truncate',2), ('postcode','direct')], [('postcode','truncate',2), ('surname','nysiis')], ] hospital_index = BigramIndex(name = 'HospIndex', dataset = tmpdata, block_definition = hosp_block_def, threshold = 0.75)