This flexible classifier allows different methods to be used to
calculate the final matching weight for a weight vector (as calculated
by a RecordComparator
as discussed in
Section 9.3). Similar to the Fellegi and
Sunter classifier, two thresholds are used to classify a record pair
into one of the three classes links, non-links or possible links. The
results of a classification are stored in a data structure, which can
then be used to produce various output forms as presented in
Chapter 11.
Instead of simply summing all weights in a weight vector, this flexible classifier allows a flexible definition of the final weight calculation by defining tuples containing a function and elements of the weight vector upon which the function is applied. The final weight is then calculated using another function that needs to be defined by the user.
The following functions can currently be used within the flexible classifier:
min
max
add
mult
avrg
Weight vector elements are selected by giving the desired indexes
(starting from 0) in a Python list, e.g. [0,1,4]
selects the
first two and the fifth field comparison weights. When initialising a
flexible classifier, the argument calculate
needs to be set to
a list made of tuples with functions and weight vector elements as
shown in the example below.
The final weight can then be calculated by again using one of the
functions 'min'
, 'max'
, 'add'
, 'mult'
, and
'avrg'
given above. The argument final_funct
has to be
used for this when a flexible classifier is initialised.
Let's make an example. Assume we have weight vectors that contain weights calculated by eight different field comparison functions (as explained in Section 9.2). We would like to calculate the final weight as being the average of 1) the sum of the first four weights, 2) the maximal value of weights five and six, and 3) the minimum of weights seven and eight. The corresponding flexible classifier can then be initialised as shown in the following example code.
# ==================================================================== flex_classifier = FlexibleClassifier(name = 'My flexible classifier', dataset_a = mydata_1, dataset_b = mydata_2, lower_threshold = 10.0, upper_threshold = 50.0, calculate = [('add', [0,1,2,3]), ('max', [4,5]), ('min', [6,7])], final_funct = 'avrg')
Note that it is possible to use a weight in more than just one of the
calculated intermediate weights. Alternatively it is also possible not
to use a weight. It is important though that weight vectors must have
as much elements as are used in the calculate
definitions (i.e.
one should not use definitions with indexes larger than the lengths
of the weight vectors).
When a flexible classifier is initialised, the following arguments need to be given.
name
dataset_a
dataset_b
dataset_a
in a deduplication process, but
most likely a different data set in a linkage process (until
different parts of the same data set are to be linked).
lower_threshold
upper_threshold
calculate
'min'
, 'max'
, 'add'
, 'mult'
,
or 'avrg'
) and a list of the weight vector elements to be
used (index numbers starting with 0).
final_weight
'min'
, 'max'
, 'add'
, 'mult'
,
or 'avrg'
.