2. Introduction

Record linkage is a rapidly growing field with applications in many areas of health and biomedical research [1,15,19]. It is an initial step in many epidemiological studies and data mining projects, in order to assemble the required data in a form suitable for analysis. Data mining aims to analyse large and complex data sets to find patterns and rules, to detect outliers or to build predictive models of such data sets [16]. Often the data required for such analyses are contained in two or more separate databases, which do not share a common unique entity identifier (key). In such cases, record linkage techniques need to be used to join the data.

Methods used to tackle the record linkage problem fall into two broad categories. Deterministic methods in which sets of often very complex rules are used to classify pairs of records as links (i.e. relating to the same person or entity) or as non-links; and probabilistic methods in which statistical models are used to classify record pairs. Probabilistic methods can be further subdivided into those based on classical probabilistic record linkage theory as developed by Fellegi and Sunter [13] in 1969, and newer approaches using maximum entropy and other machine learning techniques [10,11,24,35,36].

Historical collections of administrative and other health data nowadays contain many tens or even hundreds of millions of records, with new data being added at the rate of millions of records per annum. Although computing power has increased tremendously in the last few decades, large-scale record linkage is still a slow and resource-intensive process. There have been relatively few advances over the last decade in the way in which probabilistic record linkage is undertaken, particularly with respect to the tedious clerical review process which is still needed to make decisions about pairs of records whose linkage status is doubtful. Unlike computers, there has been no increase in the rate at which humans can undertake these clerical tasks.

Warning: Probabilistic record linkage is a powerful technique which can be used to assemble data sets which would otherwise not be available for health and biomedical research. However, there is the potential for the invasion of personal privacy whenever linkage between data sets is undertaken. It is therefore imperative that record linkage is performed in a strictly controlled and secure environment under one or more of the following conditions:

where informed consent has been given for the linkage to take place by all the individuals whose personal data is to be linked;
where a properly constituted institutional ethics committee has given permission for the linkage to take place because it considers that the public good which will result from the research substantially outweighs the public interest in the protection of privacy;
where legislation specifically permits or mandates the linkage of particular data files.

Users of the Febrl system should take time to familiarise themselves with all legislation, regulations, guidelines and procedures which relate to the protection of privacy and confidentiality, or which otherwise govern linkage between data collections in their jurisdiction. References to relevant Australian legislation and guidelines can be found on the Febrl project Web site at: http://datamining.anu.edu.au/linkage.html

The programs described in this manual, known collectively under the moniker Febrl ('Freely extensible biomedical record linkage'), are currently being developed as part of a collaborative project being undertaken by the ANU Data Mining Group and the Centre for Epidemiology and Research in the New South Wales Department of Health. The aim of the project is to develop improved techniques for probabilistic record linkage which combine classical probabilistic methods with deterministic and, in particular, machine learning techniques in order to improve the linkage quality and to reduce the incredibly time consuming and tedious manual clerical review process of possible links. Additionally, the project intends to make good use of modern high-performance parallel computing platforms, such as clusters of commodity PCs or workstations (which can be used as virtual parallel computers with some additional software installed), multiprocessor servers or supercomputers. We hope that the resulting software will allow biomedical and other researchers to link data sets of all sizes more efficiently and at reduced costs.

The Febrl program code and associated documentation and data files are published under the ANU Open Source License (see Appendix H), which is derived from the Mozilla Public License version 1.1 with minor changes to make it suitable for Australian law. The license permits the free use and redistribution of the Febrl manual (the document you are now reading) and free use, modification and re-distribution of the associated Febrl programs and data files, provided that any modifications or enhancements to the program code are made freely available to other users of the programs under the same licensing arrangements. You are strongly urged to read the license before you start using the programs. Please pay particular attention to the DISCLAIMER OF WARRANTY which appears in the license.

We hope that release of the programs under an open source license will encourage other researchers to contribute to the ongoing development of the system, and to share the responsibility for its maintenance and support. At this stage, there are many areas of the system which need further work - some of these are listed in Appendix E.

Since its initial release (Version 0.1) the Febrl system has undergone a major redesign resulting in an object-orient approach which allows easier configuration and is more extensible. This fifth release (Version 0.3) contains as major new feature a geocode matching system.

It is assumed that the reader has at least superficial familiarity with the syntax of the Python programming language in order to understand and customise the main project configuration module project.py, which, like the rest of the system, is written in Python. Later versions of the system may provide configuration tools which remove this requirement. Of course, knowledge of Python will be necessary if you wish to extend or customise the system. However, as well as being very powerful, Python is also extremely easy to learn, even for people with little or no prior programming experience. Python tutorials as well as implementations of the language itself can be found on the Python Web site at http://www.python.org. Python is a free, open source language which can be downloaded, installed and used on any number of computers for any purpose without charge. Versions of Python are available for all popular operating systems and types of computer.

The structure of this manual is as follows. Some ideas on the performance of Febrl (i.e. how long it takes to standardise and link or deduplicate certain numbers of records) is given in Section 2.1. The next Chapter gives a short overview of the techniques and applications of record linkage and data cleaning and standardisation in general. An overview of the Febrl system is given in Chapter 4, followed by a description of the central project.py module which needs to be modified by a user to control the Febrl system and to run data cleaning and standardisation, as well as deduplication and linkage processes. The task of data cleaning and standardisation as implemented in Febrl is then presented in more detail in Chapter 6, including how to define and run a standardisation process. Name and address standardisation in Febrl is done using hidden Markov models (HMMs), and this technique is introduced in Chapter 7. HMMs are a powerful alternative to the often cumbersome rules-based approach to data standardisation. Chapter 8 then deals with the issue of HMM training. Instructions for the use of the programs tagdata.py and trainhmm.py are given in this chapter. Descriptions of the various components of the record linkage and deduplication processes are given in Chapter 9, including how to define field comparison functions, indexing techniques and matching classifiers. The related task of geocoding is the topic of Chapter 10. Several output forms are supported by Febrl, including a histogram, printing of record pairs, as well as saving results into text files. Chapter 11 presents these output forms in more detail. Also discussed in Chapter 11 are assignment restrictions, which can for example be applied to force one-to-one assignments for record pairs. Currently two auxiliary programs (randomselect.py which allows random selection of input records and generate.py, which is a data set generator able to create records and duplicates) are provided and described in Chapter 12. Access to various data set formats is provided in Febrl and this is the topic of Chapter 13. The various look-up tables and their corresponding file formats are described in Chapter 14. The Febrl system is provided with logging and verbose output capabilities, and Chapter 15 shows how to define and configure a project logger. Finally, the installation of the Febrl system is discussed in Chapter 16, and information on how to run Febrl on a parallel platform using MPI and Pypar is given in Chapter 17. Note that parallelism within Febrl is in its initial stage, and we would like to ask people who are interested in this area to contact the authors for further exchange of detailed information and experiences.

In Appendix A lists of all defined hidden Markov model states are given and Appendix B contains the list of all supported tags used in the data standardisation process. A description of the rule-based name standardisation as implemented in Febrl is given in Appendix C. The manifest in Appendix D gives a list of all files contained in the current version of the Febrl distribution. A list of outstanding development tasks and planned additions and enhancements to the system appears in Appendix E. All files provided with the current Febrl version are listed in Appendix F, and in Appendix G support arrangements are discussed. Finally, a copy of the ANU Open Source License can be found in Appendix H.

Note: The authors recognise that some aspects of the Febrl project may have application in certain business and commercial settings. Such use is permitted by the ANU Open Source License under which Febrl is licensed. However, we wish to emphasise that the software is being developed purely with the needs of health and biomedical researchers in mind, and there are no plans to add features which business users might specifically need, such as Australia Post Address Matching Approval System (AMAS) processing and certification. See http://www.auspost.com.au/BCP/0,1080,CH2403%257EMO19,00.html for more information on AMAS and related technologies. It should also be remembered that the Febrl project is still in the early stages of its development, and the software cannot be considered to be of production quality.

We urge users with business or commercial data processing needs to examine the wide range of products and services available from commercial vendors. A non-comprehensive set of links to the Web sites of vendors of business-oriented data quality and data cleaning software services is available on the Febrl project Web site at http://datamining.anu.edu.au/linkage.html. The links are provided for information only and do not imply endorsement or recommendation of any particular vendor's products or services. Vendors of relevant products or services who would like a link to their Web site to be added to the Febrl project Web site should contact the authors by email.