9.1.3 Bigram Indexing

This index implements a data structure based on bigrams and allows for fuzzy blocking. The basic idea is that after an index has been built, the values of the blocking variables will be converted into a list of bigrams, which is sorted alphabetically (and duplicate bigrams are removed), and sub-lists will be built using a user provided threshold (a number between $0.0$ and $1.0$) of all possible combinations. These resulting bigram sub-lists will be inserted into an inverted index, i.e. record numbers in the blocks will be inserted into Python dictionaries for each bigram sub-list. Such an inverted index will then be used to retrieve the blocks.

When a bigram index is initialised, one argument (besides the base class arguments as presented in Section 9.1 above) that needs to be given is:

For example, assume a block definition contains the tuple

block_definition = [[('givname','direct')], ...]

and the bigram threshold is set to $0.8$. If a value 'peter' is given in the 'givname' (given name) field, the corresponding bigram list will be ['pe','et','te','er'] with four elements. It will be sorted to ['er','et','pe','te'], and using the $0.8$ threshold results in $4*0.8 = 3.2$ rounded to $3$, which means all combinations of length 3 are calculated. For the given example they are


So, the corresponding record number will be inserted into the inverted index blocks with keys 'etpete', 'erpete', 'erette', and 'eretpe'.

The lower the threshold, the shorter the sub-lists, but the more sub-lists there will be per field value, resulting in more (smaller blocks) in the inverted index.

The following example shows how to define and initialise a bigram index.

# ====================================================================

hosp_block_def = [[('surname','soundex', 3, 'reverse')],
                  [('givenname','truncate',2), ('postcode','direct')],
                  [('postcode','truncate',2), ('surname','nysiis')],

hospital_index = BigramIndex(name = 'HospIndex',
                          dataset = tmpdata,
                 block_definition = hosp_block_def,
                        threshold = 0.75)