6.9 Record Cleaning and Standardisation

Cleaning and standardising records is done by defining components (like names, addresses and dates) made of one or more fields in the input data set and initialising corresponding component standardisers as explained in the previous sections, resulting in cleaned and standardised output fields. Examples of component and record standardisers were already presented in Chapter 5 and the previous sections. Here, all available arguments for the record standardiser are described and an example is given (see code below). The record standardiser is implemented in the module standardisation.py.

When a record standardiser is initialised, the following arguments must be given.

Once one or more component standardisers and one record standardiser are initialised, the standardisation process can be started by loading records from the input data set, and passing them on to the record standardiser using the standardise (for one record) or standardise_block (for a list of records). The following example code shows how to initialise a record standardiser, assuming that the module standardisation.py has been imported, and both input and output data sets and several component standardisers have been initialised. Note that the standardisation is normally done within the deduplication and linkage routines as described in Section 9.6. Alternatively, if one is only interested in cleaning and standardising a data set a standardisation process can be started as shown in Section 6.10.

# ====================================================================
my_comp_standard = [name_rules_stand, address_hmm_stand, baby_std]

my_record_standardiser = RecordStandardiser(name = 'my-rec-stand',
                                   input_dataset = my_in_data,
                                  output_dataset = my_out_data,
                         component_standardisers = my_comp_standard)

# Load and standardise one record
one_record = my_in_data.read_record()

one_clean_record = my_record_standardiser.standardise(one_record)

# Load and standardise a block of records
records = my_in_data.read_records(0,10000)  # Load 10,000 records

clean_record_block = my_record_standardiser.standardise_block(records)