9.5.2 Flexible Classifier

This flexible classifier allows different methods to be used to
calculate the final matching weight for a weight vector (as calculated
by a `RecordComparator`

as discussed in
Section 9.3). Similar to the Fellegi and
Sunter classifier, two thresholds are used to classify a record pair
into one of the three classes links, non-links or possible links. The
results of a classification are stored in a data structure, which can
then be used to produce various output forms as presented in
Chapter 11.

Instead of simply summing all weights in a weight vector, this flexible classifier allows a flexible definition of the final weight calculation by defining tuples containing a function and elements of the weight vector upon which the function is applied. The final weight is then calculated using another function that needs to be defined by the user.

The following functions can currently be used within the flexible classifier:

`min`

Take the minimum value in the selected weight vector elements.`max`

Take the maximum value in the selected weight vector elements.`add`

Add the values in the selected weight vector elements.`mult`

Multiply the values in the selected weight vector elements.`avrg`

Calculate the average of the values in the selected weight. vector elements

Weight vector elements are selected by giving the desired indexes
(starting from 0) in a Python list, e.g. `[0,1,4]`

selects the
first two and the fifth field comparison weights. When initialising a
flexible classifier, the argument `calculate`

needs to be set to
a list made of tuples with functions and weight vector elements as
shown in the example below.

The final weight can then be calculated by again using one of the
functions `'min'`

, `'max'`

, `'add'`

, `'mult'`

, and
`'avrg'`

given above. The argument `final_funct`

has to be
used for this when a flexible classifier is initialised.

Let's make an example. Assume we have weight vectors that contain
weights calculated by eight different field comparison functions (as
explained in Section 9.2). We would like
to calculate the final weight as being the *average* of 1) the
*sum* of the first four weights, 2) the *maximal* value of
weights five and six, and 3) the *minimum* of weights seven and
eight. The corresponding flexible classifier can then be initialised
as shown in the following example code.

# ==================================================================== flex_classifier = FlexibleClassifier(name = 'My flexible classifier', dataset_a = mydata_1, dataset_b = mydata_2, lower_threshold = 10.0, upper_threshold = 50.0, calculate = [('add', [0,1,2,3]), ('max', [4,5]), ('min', [6,7])], final_funct = 'avrg')

Note that it is possible to use a weight in more than just one of the
calculated intermediate weights. Alternatively it is also possible not
to use a weight. It is important though that weight vectors must have
as much elements as are used in the `calculate`

definitions (i.e.
one should not use definitions with indexes larger than the lengths
of the weight vectors).

When a flexible classifier is initialised, the following arguments need to be given.

`name`

A name for a the classifier. This should be a short string.`dataset_a`

A reference to a data set. This must be the same data set as the first data set defined within a record comparator.`dataset_b`

A reference to a data set, which must be the same as the second data set defined within a record comparator. This data set will be the same as`dataset_a`

in a deduplication process, but most likely a different data set in a linkage process (until different parts of the same data set are to be linked).`lower_threshold`

A number, which is the lower threshold for the classifier.`upper_threshold`

A number, which is the upper threshold for the classifier. It must be larger than the lower threshold.`calculate`

The definitions for the calculation of intermediate results using selected elements of the weight vector. This must be a list containing tuples, with each tuple being made of a function (one of`'min'`

,`'max'`

,`'add'`

,`'mult'`

, or`'avrg'`

) and a list of the weight vector elements to be used (index numbers starting with 0).`final_weight`

The function to be used to calculate the final weight. Must be one of`'min'`

,`'max'`

,`'add'`

,`'mult'`

, or`'avrg'`

.