3. Data Cleaning and Record Linkage

Record linkage techniques are used to link data records relating to the same entities, such as patients or customers. Record linkage can be used to improve data quality and integrity, to allow re-use of existing data sources for new studies, and to reduce costs and effort in data acquisition for research studies.

If a unique entity identifier or key is available in all of the data sets to be linked, then the problem of linking at the entity level is trivial - a simple join operation in SQL or its equivalent in other data management systems is all that is required. However, if no unique key is shared by all of the data sets, then various record linkage techniques need to be used. As discussed in the previous chapter, these techniques can be broadly classified into deterministic or rules-based approaches, and probabilistic approaches.

No matter what technique is used, a number of issues need to be addressed when linking data. Often, data is recorded or captured in various formats, and data items may be missing or contain errors. A pre-processing phase that aims to clean and standardise the data is therefore an essential first step in every linkage process. Data sets may also contain duplicate entries, in which case linkage may need to be applied within a data set to de-duplicate it before linkage with other files is attempted.

The process of linking records has various names in different user communities. While epidemiologists and statisticians speak of record linkage, the same process is often referred to as data matching or as the object identity problem [17] by computer scientists, whereas it is sometimes called merge/purge processing or list washing in commercial processing of customer databases or mailing lists. Historically, the statistical and the computer science communities have developed their own techniques, and until recently few cross-references could be found. In this Chapter we give an overview and try to identify similarities in the extant methods.

Computer-assisted record linkage goes back as far as the 1950s. At this time, most linkage projects were based on ad hoc heuristic methods. The basic ideas of probabilistic record linkage were introduced by Newcombe and Kennedy [26] in 1962 while the theoretical foundation was provided by Fellegi and Sunter [13] in 1969. Using frequency counts [39] to derive agreement and disagreement probabilities, each pair of fields of each compared record pair is assigned a match weight, and critical values of the sum of these match weights are used to designate a pair of records as either a link, a possible link or a non-link. Possible links are those pairs for which human oversight, also known as clerical review, is needed to decide their final linkage status. In theory, the person undertaking this clerical review has access to additional data (or may be able to seek it out) which enables them to resolve the linkage status. In practice, often no additional data is available and the clerical review process becomes one of applying human intuition or common sense to the decision based on available data. One of the aims of the Febrl project is to automate (and possibly improve upon) this process through the use of machine learning and data mining techniques.

To reduce the number of comparisons (potentially each record in one data set has to be compared with every record in a second data set), blocking techniques are typically used. The data sets are split into smaller blocks using blocking variables, like the postcode or the Soundex [21] encoding of surnames. Only records within the same blocks are then compared.

To deal with typographical variations and data entry errors, approximate string comparison functions [30] are often used for names and addresses. These comparators usually return a score between 0.0 (two strings are completely different) and 1.0 (two strings are the same).

In recent years, researchers have been exploring the use of machine learning and data mining techniques [35] both to improve the linkage process and to allow linkage of larger data sets. For very large data sets, with hundreds of millions of records, special techniques have to be applied [38] to be able to handle such large volumes of data. Sorting large number of records becomes the main bottleneck, so extracting sub-sets of possible links from an unsorted large data file [40] has to be done as a pre-processing step before the actual record comparisons can be done. The authors of [11,36] describe a hybrid system that in a first step uses unsupervised clustering on a small sample data set to create data that can be used by a classifier in the second step to classify records into links and non-links. The authors do not directly address the problem of data cleaning and standardisation, rather they use approximate string comparison algorithms to be able to deal with variations in strings. Clustering of large and high-dimensional data sets with applications to matching is discussed in [24]. The authors propose a two step clustering algorithm that in the first step uses cheap approximate distance metrics to form overlapping canopies, which are in a second step clustered using traditional approaches. Another approach [25] learns field specific string-edit distance weights and a binary classifier based on support vector machines (SVM) to find duplicate records in text databases.

The terms data cleaning (or data cleansing), data standardisation, data scrubbing, data pre-processing and ETL (extraction, transformation and loading) are used synonymously to refer to the general tasks of transforming the source data (often derived from operational, transactional information systems) into clean and consistent sets of records which are suitable for record linkage or for loading into a data warehouse [32]. The meaning of the term standardisation in this context is quite different from its use in epidemiology and statistics, where it usually refers to a method of dealing with the confounding effects of age. The main task of data standardisation in record linkage is the resolution of inconsistencies in the way information is represented or encoded in the data. Inconsistencies can arise through typographical or other data capture errors, the use of different code sets or abbreviations, and differences in record layouts.

Fuzzy techniques and methods from information retrieval have been used to address the record linkage problem, with varying degrees of success. One approach is to represent text (or records) as document vectors and compute the cosine distance [10] between such vectors. Another possibility is to use an SQL like language [14] that allows approximate joins and cluster building of similar records, as well as decision functions that decide if two records represent the same entity. Other methods [22] include statistical outlier identification, pattern matching, clustering and association rules based approaches. Sorting data sets (to group similar records together) and comparing records within a sliding window [17] is a technique similar to blocking as applied by traditional record linkage approaches. The accuracy of the matching can be improved by having smaller window sizes and performing several passes over the data using different (often compound) keys, rather than having a large window size but only one pass. This corresponds to applying several blocking strategies in a record linkage process.

Even though most approaches described in the computer science literature use approximate string comparison operators and external look-up tables to improve the matching quality, many don't consider the statistical theory of record linkage as developed by Fellegi and Sunter [13] and improved and extended by others. The Febrl system uses this approach as the basis of its record linkage engine, although deterministic and machine learning techniques may also be added at a later date.

Of course, the problem of finding similar entities not only applies to records which refer to persons. Increasingly important is the removal of duplicates in the results returned by Web search engines and automatic text indexing systems, where copies of documents have to be identified and filtered out before being presented to the user.