11.1 Record Pair One-To-One Assignment Restrictions

In many linkage projects one is often only interested in the best record pair matches, and one-to-one assignments need to be forced. The simplest way to do this would be to use a greedy algorithm, but this would result in some assignments being not optimal due to the transitive closure problem.

For example, assume a record is linked to a record with a match weight of , and record is also linked to a record with a weight of , and record is linked to record with a weight of . Assuming that the record pair list is sorted, a greedy assignment procedure will assign record to record (because the weight for pair is larger than for ), but then it can not assign record to record because record has already been assigned (with a lower weight than the to weight). Even if the record pair list is sorted according to the match weights the assignment will most likely not be optimal.

This linear sum assignment problem can be solved with special
algorithms that find the optimal solution over all possible
assignments. In **Febrl** we have implemented the *Auction*
algorithm (in the `lap.py` module) as developed by
Bertsekas [3] and others.

Our linear assignment procedure implementation takes as input the
results calculated from a classifier (i.e. record pairs and their
total weights), and optionally a threshold value (in which case all
record pairs that have a weight lower than this threshold are not
considered in the assignment procedure). A third argument is the
process type which can be either `deduplication`

or
`linkage`

. The reason for this is that in a deduplication process
a record number can only be part of one assignment (i.e. record
in two record pairs and is the same), while for a
linkage process a record number in the first data set does not
identify the same record as a record number in the second data set
(i.e. record pair and can both be assigned).

The first step in the assignment procedure (after filtering out all
record pairs with weights less than the given threshold) is to find
record numbers that are independent, i.e. that do only appear in
exactly one record pair, as they are not part of a multi-record
assignment problem. Secondly, we extract all sub-sets of connected
record pairs that build one assignment problem. Each of these sub-sets
can be solved independently, which is then done in the third step
using the *Auction* algorithm [3].

The output of the one-to-one assignment routine is a dictionary that contains the optimal record pair assignments and their corresponding matching weights.