The sorting index extends the idea of the classical blocking index in that the values of the blocks (e.g. the Soundex encodings of surnames) are sorted alphabetically and then a sliding window is moved over these sorted blocks. When a sorting index is initialised, one argument (besides the base class arguments as presented in Section 9.1 above) that needs to be given is:
window_size
a123: [4,12,89,99]
a129: [6,32,54,84,91]
a245: [1,39]
a689: [3,17,21,35,49,76,87,93]
a911: [2,42,66]
b111: [8]
While with the blocking index only the records within one block are compared with the records in the corresponding block of the second data set, with the sorting index and window size 3 larger blocks are formed by combining three consecutive blocks together:
[a123,a129,a245]: [1,4,6,12,32,39,54,84,89,91,99]
[a129,a245,a689]:
[1,3,6,17,21,32,35,39,49,54,76,84,87,91,93]
[a245,a689,a911]: [1,2,3,17,21,35,39,42,49,66,76,87,93]
[a689,a911,b111]: [2,3,8,17,21,35,42,49,66,76,87,93]
The idea behind this is, that neighbouring blocks might contain records with similar values in the blocking variables due to errors in the original values. The example below shows how to define and initialise a sorting index.
# ==================================================================== hosp_block_def = [[('surname','soundex', 3, 'reverse')], [('givenname','truncate',2), ('postcode','direct')], [('postcode','truncate',2), ('surname','nysiis')], ] hospital_index = SortingIndex(name = 'HospIndex', dataset = tmpdata, block_definition = hosp_block_def, window_size = 5)
Note that if the window size is set to then the sorting index becomes equivalent to the blocking index.