9.1.3 Bigram Indexing

 This index implements a data structure based on bigrams and allows for fuzzy blocking. The basic idea is that after an index has been built, the values of the blocking variables will be converted into a list of bigrams, and permutations of sub-lists will be built using a given threshold (a number between $0.0$ and $1.0$) of all possible permutations. The resulting bigram lists will be inserted into an inverted index, i.e. record numbers in the blocks will be inserted into Python dictionaries for each bigram. Such an inverted index will then be used to retrieve the blocks.

When a bigram index is initialised, one argument (besides the base class arguments as presented in Section 9.1 above) that needs to be given is:

For example, assume a block definition contains the tuple

block_definition = [[('givname','direct')], ...]

and the bigram threshold is set to $0.8$. If a value 'peter' is given in the 'givname' (given name) field, the corresponding bigram list will be ['pe','et','te','er'] with four elements, so using the $0.8$ threshold results in $4*0.8 = 3.2$ rounded to $3$, which means all permutations of length 3 are calculated. For the given example they are

['pe','et','te']
['pe','et','er']
['pe','te','er']
['et','te','er']

So, the corresponding record number will be inserted into the inverted index blocks with keys 'peette', 'peeter', 'peteer, and 'etteer'.

The lower the threshold, the shorter the sub-lists, but the more sub-lists there will be per field value, resulting in more (smaller blocks) in the inverted index.

The following example shows how to define and initialise a bigram index.

# ====================================================================

hosp_block_def = [[('surname','soundex', 3, 'reverse')],
                  [('givenname','truncate',2), ('postcode','direct')],
                  [('postcode','truncate',2), ('surname','nysiis')],
                 ]

hospital_index = BigramIndex(name = 'HospIndex',
                          dataset = tmpdata,
                 block_definition = hosp_block_def,
                        threshold = 0.75)