6. Data Cleaning and Standardisation

The aim of the data cleaning and standardisation process is to transform the information stored in the original data into a well defined and consistent form. Personal information may be recorded or captured in various formats, spelled differently, it might be outdated, some items may be missing or contain errors. For example, if data is captured over the telephone, spelling variations of names are common. Typing errors happen frequently when dates are entered. The data cleaning and standardisation steps attempt to deal with these problems. Conversion of the original input data into a well defined form, and segmenting it into many smaller output fields, allows the linkage process to be much more accurate.

Figure 6.1: Example of a personal information standardisation.

As an example, the record in Figure 6.1 with four input components is cleaned and split into 14 output fields (the dark gray boxes). Comparing these output fields individually with the corresponding output fields of other records results in a much better linkage quality than just comparing for example the whole name or the whole address as a string with the name or address from other records.

Personal attributes (data items) used for record linkage can be broadly categorised into five classes: names, addresses, dates (such as date of birth) and times, categorical attributes (such as sex or country of birth) and scalar quantities (such as height or weight). The primary criteria for such attributes is that they are relatively invariant over time - they should not change, or at least not change often. For these reasons attributes such as diagnoses or procedures, or textual narratives of medical findings, are generally not used for record linkage purposes. Similarly, scalar attributes are also rarely used because they are subject to change, although it depends on the specific application. Currently Febrl provides specific facilities for the processing of names, addresses and dates. Later versions will provide facilities for the transformation of coded and uncoded categorical attributes into standard formats and values. In the meantime, the Python programming language in which Febrl is implemented can be used to write special purpose data transformation and cleaning functions or routines. Due to its object-oriented approach, it is fairly easy to integrate custom data transformation procedures written by end users with other aspects of Febrl processing.

The data cleaning and standardisation is implemented in the module standardisation.py and uses various routines from the modules address.py, date.py, name.py and phonenum.py which contain functionalities to clean and standardise the corresponding components. Different correction lists and tagging look-up tables are used for the cleaning and standardisation tasks.

The following Section 6.1 discusses the underlying techniques used for cleaning and standardisation of names and addresses in more details. A list of the possible output fields is given in Section 6.2, followed by descriptions in Sections 6.3, 6.4, 6.5 and 6.6, of how component standardisers for names, addresses and dates are initialised. Finally, Section 6.9 describes how a record standardiser can be initialised and how records can be cleaned and standardised.