This flexible classifier allows different methods to be used to
calculate the final matching weight for a weight vector (as calculated
RecordComparator as discussed in
Section 9.3). Similar to the Fellegi and
Sunter classifier, two thresholds are used to classify a record pair
into one of the three classes links, non-links or possible links. The
results of a classification are stored in a data structure, which can
then be used to produce various output forms as presented in
Instead of simply summing all weights in a weight vector, this flexible classifier allows a flexible definition of the final weight calculation by defining tuples containing a function and elements of the weight vector upon which the function is applied. The final weight is then calculated using another function that needs to be defined by the user.
The following functions can currently be used within the flexible classifier:
Weight vector elements are selected by giving the desired indexes
(starting from 0) in a Python list, e.g.
[0,1,4] selects the
first two and the fifth field comparison weights. When initialising a
flexible classifier, the argument
calculate needs to be set to
a list made of tuples with functions and weight vector elements as
shown in the example below.
The final weight can then be calculated by again using one of the
'avrg' given above. The argument
final_funct has to be
used for this when a flexible classifier is initialised.
Let's make an example. Assume we have weight vectors that contain weights calculated by eight different field comparison functions (as explained in Section 9.2). We would like to calculate the final weight as being the average of 1) the sum of the first four weights, 2) the maximal value of weights five and six, and 3) the minimum of weights seven and eight. The corresponding flexible classifier can then be initialised as shown in the following example code.
# ==================================================================== flex_classifier = FlexibleClassifier(name = 'My flexible classifier', dataset_a = mydata_1, dataset_b = mydata_2, lower_threshold = 10.0, upper_threshold = 50.0, calculate = [('add', [0,1,2,3]), ('max', [4,5]), ('min', [6,7])], final_funct = 'avrg')
Note that it is possible to use a weight in more than just one of the
calculated intermediate weights. Alternatively it is also possible not
to use a weight. It is important though that weight vectors must have
as much elements as are used in the
calculate definitions (i.e.
one should not use definitions with indexes larger than the lengths
of the weight vectors).
When a flexible classifier is initialised, the following arguments need to be given.
dataset_ain a deduplication process, but most likely a different data set in a linkage process (until different parts of the same data set are to be linked).
'avrg') and a list of the weight vector elements to be used (index numbers starting with 0).