Two methods (routines) to define and start a linkage or a
deduplication process are available within a project
object,
respectively.
Assuming that a project
object has been created (by copying
either the template module project-linkage.py or
project-deduplicate.py provided - see
Chapter 5 for more details) and component and
record standardisers, field and record comparators, indexes and a
classifier have all been defined, deduplication of a data set or
linkage of two data sets can be done by one simple call to the
corresponding method as shown in the following examples.
# ==================================================================== myproject.deduplicate(input_dataset = hospital_data, tmp_dataset = tmpdata, rec_standardiser = hospital_standardiser, rec_comparator = hospital_comparator, blocking_index = hospital_index, classifier = hospital_fell_sunter_classifier, first_record = 0, number_records = 100000, output_histogram = True, output_rec_pair_details = 'hosp-dedupl-details.txt', output_rec_pair_weights = 'hosp-dedupl-weights.txt', output_threshold = 30.0, output_assignment = 'one2one') # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - myproject.link(input_dataset_a = hospital_data, tmp_dataset_a = tmpdata1, input_dataset_b = accident_data, tmp_dataset_b = tmpdata2, rec_standardiser_a = hospital_standardiser, rec_standardiser_b = accident_standardiser, rec_comparator = hosp_acc_comparator, blocking_index_a = hospital_index, blocking_index_b = accident_index, classifier = hosp_acc_fell_sunter_classifier, first_record_a = 0, number_records_a = 10000, first_record_b = 0, number_records_b = 20000, output_histogram = True, output_rec_pair_details = True, output_rec_pair_weights = 'hosp-linkage-weights.txt', output_threshold = 40.0, output_assignment = 'one2one')
In the first example, the records with number 0 to 100,000 in a fictitious hospital data set are deduplicated, and in the second example the records with number 0 to 10,000 in a hospital data set are linked with records number 0 to 20,000 in an accident data set.
For the deduplication method, the following arguments need to be defined.
input_dataset
read
access mode.
tmp_dataset
readwrite
) that will hold the cleaned and
standardised records before they are deduplicated.
rec_standardiser
None
(i.e. rec_standardiser = None
) then
the records from the input data set will be used directly in the
deduplication process (they will be directly copied into the
temporary data set. In such a case, the temporary data set must
have the same field name definitions as the input data set). See
Chapter 6 and
Section 6.8 for more details on how to
initialise a record standardiser.
rec_comparator
blocking_index
classifier
first_record
None
), then it will be set to the first record number
(i.e. record with number 0
).
number_records
None
), it will be set to the total number of records in
the data set.
output_histogram
output_rec_pair_details
output_rec_pair_weights
output_threshold
output_assignment
For a record linkage process, similar arguments are needed. The main differences are basically that references to two input and two temporary data sets must be given, then references to two record standardisers and two indexing definitions (one per data set), plus first record and number of records for two data sets.