Record linkage is the task of comparing records and deciding whether they are a match (i.e. determining if they represent the same entity) or a non-match (i.e. determining if they represent different entities), or, if this decision can not be made by the record linkage system, using human intervention (clerical review) to decide the matching status of a record pair. Assuming that cleaned and standardised records are available, the process of linking records or the deduplication of a data set consists of several steps.
The following sections describe in detail how a linkage or
deduplication process can be defined using a project
object
(as shown in the example at the end of Chapter 5),
and how its necessary components (such as indexes, field comparison
functions and classifiers) can be defined. Indexing is the topic of
Section 9.1. All field comparison functions
available are described in Section 9.2,
and the initialisation of a record comparator is presented in
Section 9.3.
Section 9.4 includes
example code that shows both field comparison functions and record
comparator initialisation. The definition of classifiers is then
discussed in Section 9.5, and the chapter
concludes with Section 9.6 which presents
how to define and start a linkage or deduplication process,
respectively.