Two methods (routines) to define and start a linkage or a
deduplication process are available within a project
object,
respectively.
Assuming that a project
object has been created (by copying
either the template module project-linkage.py or
project-deduplicate.py provided - see
Chapter 5 for more details) and component and
record standardisers, field and record comparators, indexes and a
classifier have all been defined, deduplication of a data set or
linkage of two data sets can be done by one simple call to the
corresponding method as shown in the following examples.
# ==================================================================== myproject.deduplicate(input_dataset = hospital_data, tmp_dataset = tmpdata, rec_standardiser = hospital_standardiser, rec_comparator = hospital_comparator, blocking_index = hospital_index, classifier = hospital_fell_sunter_classifier, first_record = 0, number_records = 100000, weight_vector_file = 'dedup-example-weight-vecs.csv', weight_vector_rec_field = 'rec_id', output_histogram = True, output_rec_pair_details = 'hosp-dedupl-details.txt', output_rec_pair_weights = 'hosp-dedupl-weights.csv', output_threshold = 30.0, output_assignment = 'one2one') # ==================================================================== myproject.link(input_dataset_a = hospital_data, tmp_dataset_a = tmpdata1, input_dataset_b = accident_data, tmp_dataset_b = tmpdata2, rec_standardiser_a = hospital_standardiser, rec_standardiser_b = accident_standardiser, rec_comparator = hosp_acc_comparator, blocking_index_a = hospital_index, blocking_index_b = accident_index, classifier = hosp_acc_fell_sunter_classifier, first_record_a = 0, number_records_a = 10000, first_record_b = 0, number_records_b = 20000, weight_vector_file = 'link-example-weight-vecs.csv', weight_vector_rec_fields = ['rec_id','rec_id'], output_histogram = True, output_rec_pair_details = True, output_rec_pair_weights = 'hosp-linkage-weights.csv', output_threshold = 40.0, output_assignment = 'one2one')
In the first example, the records with number 0 to 100,000 in a fictitious hospital data set are deduplicated, and in the second example the records with number 0 to 10,000 in a hospital data set are linked with records number 0 to 20,000 in an accident data set.
For the deduplication method, the following arguments need to be defined.
input_dataset
read
access mode.
tmp_dataset
readwrite
) that will hold the cleaned and
standardised records before they are deduplicated.
rec_standardiser
None
(i.e. rec_standardiser = None
) then
the records from the input data set will be used directly in the
deduplication process (they will be directly copied into the
temporary data set. In such a case, the temporary data set must
have the same field name definitions as the input data set). See
Chapter 6 and
Section 6.9 for more details on how to
initialise a record standardiser.
rec_comparator
blocking_index
classifier
first_record
None
), then it will be set to the first record number
(i.e. record with number 0
).
number_records
None
), it will be set to the total number of records in
the data set.
weight_vector_file
weight_vector_rec_fields
) will be saved into
a CSV (comma separated values) text file. An existing file with
the given name will be erased first. If set to None
no
weight vector file will be written. A header line will be
written with the column names being the names of the field
comparison functions (see Section 9.2
for more details).
weight_vector_rec_field
/
weight_vector_rec_fields
None
(the default) in
which case the first two columns in the weight vector file (if
defined) will be the (internal) record numbers for the two
records being compared (resulting in a weight vector).
Alternatively, this attribute can be set to the name of a field
in the input data set (for deduplication) or to a list with two
field names - one from each of the two input data sets (for a
linkage project). In these cases the first two columns in the
weight vector file will contain the corresponding values of the
records being compared.
output_histogram
output_rec_pair_details
output_rec_pair_weights
output_threshold
output_assignment
For a record linkage process, similar arguments are needed. The main differences are basically that references to two input and two temporary data sets must be given, then references to two record standardisers and two indexing definitions (one per data set), plus first record and number of records for two data sets.