9.6 Starting a Linkage or Deduplication Process

Two methods (routines) to define and start a linkage or a deduplication process are available within a project object, respectively.

Assuming that a project object has been created (by copying either the template module project-linkage.py or project-deduplicate.py provided - see Chapter 5 for more details) and component and record standardisers, field and record comparators, indexes and a classifier have all been defined, deduplication of a data set or linkage of two data sets can be done by one simple call to the corresponding method as shown in the following examples.

# ====================================================================

myproject.deduplicate(input_dataset = hospital_data,
                        tmp_dataset = tmpdata,
                   rec_standardiser = hospital_standardiser,
                     rec_comparator = hospital_comparator,
                     blocking_index = hospital_index,
                         classifier = hospital_fell_sunter_classifier,
                       first_record = 0,
                     number_records = 100000,
                 weight_vector_file = 'dedup-example-weight-vecs.csv',
            weight_vector_rec_field = 'rec_id',
                   output_histogram = True,
            output_rec_pair_details = 'hosp-dedupl-details.txt',
            output_rec_pair_weights = 'hosp-dedupl-weights.csv',
                   output_threshold = 30.0,
                  output_assignment = 'one2one')

# ====================================================================

myproject.link(input_dataset_a = hospital_data,
                 tmp_dataset_a = tmpdata1,
               input_dataset_b = accident_data,
                 tmp_dataset_b = tmpdata2,
            rec_standardiser_a = hospital_standardiser,
            rec_standardiser_b = accident_standardiser,
                rec_comparator = hosp_acc_comparator,
              blocking_index_a = hospital_index,
              blocking_index_b = accident_index,
                    classifier = hosp_acc_fell_sunter_classifier,
                first_record_a = 0,
              number_records_a = 10000,
                first_record_b = 0,
              number_records_b = 20000,
            weight_vector_file = 'link-example-weight-vecs.csv',
      weight_vector_rec_fields = ['rec_id','rec_id'],
              output_histogram = True,
       output_rec_pair_details = True,
       output_rec_pair_weights = 'hosp-linkage-weights.csv',
              output_threshold = 40.0,
             output_assignment = 'one2one')

In the first example, the records with number 0 to 100,000 in a fictitious hospital data set are deduplicated, and in the second example the records with number 0 to 10,000 in a hospital data set are linked with records number 0 to 20,000 in an accident data set.

For the deduplication method, the following arguments need to be defined.

input_dataset
A reference to a data set which contains the (raw) input data. This data must be initialised in read access mode.

tmp_dataset
A reference to a direct random access data set (initialised in access mode readwrite) that will hold the cleaned and standardised records before they are deduplicated.

rec_standardiser
If the input data set contains records that need to be cleaned and standardised, then this argument should be a reference to a record standardiser that performs this task. If this argument is set to None (i.e. rec_standardiser = None) then the records from the input data set will be used directly in the deduplication process (they will be directly copied into the temporary data set. In such a case, the temporary data set must have the same field name definitions as the input data set). See Chapter 6 and Section 6.9 for more details on how to initialise a record standardiser.

rec_comparator
A reference to a record comparator that contains field comparison functions that compare record pairs from the temporary data set field by field and that produces a weight vector for each record pair, which is given to the classifier. See Section 9.3 on how to define a record comparator.

blocking_index
A reference to an indexing object defined on the temporary data set as described in Section 9.1.

classifier
A reference to a classifier that classifies weight vectors. See Section 9.5 for more details on classification of weight vectors.

first_record
The record number of the first record in the input data set to be processed. If this argument is not given (or set to None), then it will be set to the first record number (i.e. record with number 0).

number_records
The number of records from the input data set that should be processed. If this argument is not given (or set to None), it will be set to the total number of records in the data set.

weight_vector_file
By setting this argument to a string (assumed to be a file name) the raw weight vectors (together with either their record numbers or the values of a selected input field - see the next argument weight_vector_rec_fields) will be saved into a CSV (comma separated values) text file. An existing file with the given name will be erased first. If set to None no weight vector file will be written. A header line will be written with the column names being the names of the field comparison functions (see Section 9.2 for more details).

weight_vector_rec_field / weight_vector_rec_fields
This attribute can either be set to None (the default) in which case the first two columns in the weight vector file (if defined) will be the (internal) record numbers for the two records being compared (resulting in a weight vector). Alternatively, this attribute can be set to the name of a field in the input data set (for deduplication) or to a list with two field names - one from each of the two input data sets (for a linkage project). In these cases the first two columns in the weight vector file will contain the corresponding values of the records being compared.

output_histogram
output_rec_pair_details
output_rec_pair_weights
output_threshold
output_assignment
These arguments are needed to define the desired output forms, and are discussed in detail in Chapter 11.

For a record linkage process, similar arguments are needed. The main differences are basically that references to two input and two temporary data sets must be given, then references to two record standardisers and two indexing definitions (one per data set), plus first record and number of records for two data sets.