Two methods (routines) to define and start a linkage or a
deduplication process are available within a
Assuming that a
project object has been created (by copying
either the template module project-linkage.py or
project-deduplicate.py provided - see
Chapter 5 for more details) and component and
record standardisers, field and record comparators, indexes and a
classifier have all been defined, deduplication of a data set or
linkage of two data sets can be done by one simple call to the
corresponding method as shown in the following examples.
# ==================================================================== myproject.deduplicate(input_dataset = hospital_data, tmp_dataset = tmpdata, rec_standardiser = hospital_standardiser, rec_comparator = hospital_comparator, blocking_index = hospital_index, classifier = hospital_fell_sunter_classifier, first_record = 0, number_records = 100000, weight_vector_file = 'dedup-example-weight-vecs.csv', weight_vector_rec_field = 'rec_id', output_histogram = True, output_rec_pair_details = 'hosp-dedupl-details.txt', output_rec_pair_weights = 'hosp-dedupl-weights.csv', output_threshold = 30.0, output_assignment = 'one2one') # ==================================================================== myproject.link(input_dataset_a = hospital_data, tmp_dataset_a = tmpdata1, input_dataset_b = accident_data, tmp_dataset_b = tmpdata2, rec_standardiser_a = hospital_standardiser, rec_standardiser_b = accident_standardiser, rec_comparator = hosp_acc_comparator, blocking_index_a = hospital_index, blocking_index_b = accident_index, classifier = hosp_acc_fell_sunter_classifier, first_record_a = 0, number_records_a = 10000, first_record_b = 0, number_records_b = 20000, weight_vector_file = 'link-example-weight-vecs.csv', weight_vector_rec_fields = ['rec_id','rec_id'], output_histogram = True, output_rec_pair_details = True, output_rec_pair_weights = 'hosp-linkage-weights.csv', output_threshold = 40.0, output_assignment = 'one2one')
In the first example, the records with number 0 to 100,000 in a fictitious hospital data set are deduplicated, and in the second example the records with number 0 to 10,000 in a hospital data set are linked with records number 0 to 20,000 in an accident data set.
For the deduplication method, the following arguments need to be defined.
readwrite) that will hold the cleaned and standardised records before they are deduplicated.
rec_standardiser = None) then the records from the input data set will be used directly in the deduplication process (they will be directly copied into the temporary data set. In such a case, the temporary data set must have the same field name definitions as the input data set). See Chapter 6 and Section 6.9 for more details on how to initialise a record standardiser.
None), then it will be set to the first record number (i.e. record with number
None), it will be set to the total number of records in the data set.
weight_vector_rec_fields) will be saved into a CSV (comma separated values) text file. An existing file with the given name will be erased first. If set to
Noneno weight vector file will be written. A header line will be written with the column names being the names of the field comparison functions (see Section 9.2 for more details).
None(the default) in which case the first two columns in the weight vector file (if defined) will be the (internal) record numbers for the two records being compared (resulting in a weight vector). Alternatively, this attribute can be set to the name of a field in the input data set (for deduplication) or to a list with two field names - one from each of the two input data sets (for a linkage project). In these cases the first two columns in the weight vector file will contain the corresponding values of the records being compared.
For a record linkage process, similar arguments are needed. The main differences are basically that references to two input and two temporary data sets must be given, then references to two record standardisers and two indexing definitions (one per data set), plus first record and number of records for two data sets.