In many linkage projects one is often only interested in the best record pair matches, and one-to-one assignments need to be forced. The simplest way to do this would be to use a greedy algorithm, but this would result in some assignments being not optimal due to the transitive closure problem.
For example, assume a record is linked to a record with a match weight of , and record is also linked to a record with a weight of , and record is linked to record with a weight of . Assuming that the record pair list is sorted, a greedy assignment procedure will assign record to record (because the weight for pair is larger than for ), but then it can not assign record to record because record has already been assigned (with a lower weight than the to weight). Even if the record pair list is sorted according to the match weights the assignment will most likely not be optimal.
This linear sum assignment problem can be solved with special algorithms that find the optimal solution over all possible assignments. In Febrl we have implemented the Auction algorithm (in the lap.py module) as developed by Bertsekas [3] and others.
Our linear assignment procedure implementation takes as input the
results calculated from a classifier (i.e. record pairs and their
total weights), and optionally a threshold value (in which case all
record pairs that have a weight lower than this threshold are not
considered in the assignment procedure). A third argument is the
process type which can be either deduplication
or
linkage
. The reason for this is that in a deduplication process
a record number can only be part of one assignment (i.e. record
in two record pairs and is the same), while for a
linkage process a record number in the first data set does not
identify the same record as a record number in the second data set
(i.e. record pair and can both be assigned).
The first step in the assignment procedure (after filtering out all record pairs with weights less than the given threshold) is to find record numbers that are independent, i.e. that do only appear in exactly one record pair, as they are not part of a multi-record assignment problem. Secondly, we extract all sub-sets of connected record pairs that build one assignment problem. Each of these sub-sets can be solved independently, which is then done in the third step using the Auction algorithm [3].
The output of the one-to-one assignment routine is a dictionary that contains the optimal record pair assignments and their corresponding matching weights.