9.6 Starting a Linkage or Deduplication Process

Two methods (routines) to define and start a linkage or a deduplication process are available within a project object, respectively.

Assuming that a project object has been created (by copying either the template module project-linkage.py or project-deduplicate.py provided - see Chapter 5 for more details) and component and record standardisers, field and record comparators, indexes and a classifier have all been defined, deduplication of a data set or linkage of two data sets can be done by one simple call to the corresponding method as shown in the following examples.

# ====================================================================

myproject.deduplicate(input_dataset = hospital_data,
                        tmp_dataset = tmpdata,
                   rec_standardiser = hospital_standardiser,
                     rec_comparator = hospital_comparator,
                     blocking_index = hospital_index,
                         classifier = hospital_fell_sunter_classifier,
                       first_record = 0,
                     number_records = 100000,
                 weight_vector_file = 'dedup-example-weight-vecs.csv',
            weight_vector_rec_field = 'rec_id',
                   output_histogram = True,
            output_rec_pair_details = 'hosp-dedupl-details.txt',
            output_rec_pair_weights = 'hosp-dedupl-weights.csv',
                   output_threshold = 30.0,
                  output_assignment = 'one2one')

# ====================================================================

myproject.link(input_dataset_a = hospital_data,
                 tmp_dataset_a = tmpdata1,
               input_dataset_b = accident_data,
                 tmp_dataset_b = tmpdata2,
            rec_standardiser_a = hospital_standardiser,
            rec_standardiser_b = accident_standardiser,
                rec_comparator = hosp_acc_comparator,
              blocking_index_a = hospital_index,
              blocking_index_b = accident_index,
                    classifier = hosp_acc_fell_sunter_classifier,
                first_record_a = 0,
              number_records_a = 10000,
                first_record_b = 0,
              number_records_b = 20000,
            weight_vector_file = 'link-example-weight-vecs.csv',
      weight_vector_rec_fields = ['rec_id','rec_id'],
              output_histogram = True,
       output_rec_pair_details = True,
       output_rec_pair_weights = 'hosp-linkage-weights.csv',
              output_threshold = 40.0,
             output_assignment = 'one2one')

In the first example, the records with number 0 to 100,000 in a fictitious hospital data set are deduplicated, and in the second example the records with number 0 to 10,000 in a hospital data set are linked with records number 0 to 20,000 in an accident data set.

For the deduplication method, the following arguments need to be defined.

For a record linkage process, similar arguments are needed. The main differences are basically that references to two input and two temporary data sets must be given, then references to two record standardisers and two indexing definitions (one per data set), plus first record and number of records for two data sets.