Techniques for Scalable Privacy-preserving Record Linkage

Tutorial held at CIKM 2013, Burlingame, 28 October 2013.


CIKM 2013 PPRL Tutorial slides (PDF, 2.6 MBytes)


Privacy-Preserving Record Linkage (PPRL) is an increasingly important topic in data management, data engineering, and data mining, as organizations in both the private and public sectors are under pressure to share, integrate, and link their data in order to allow analysis that is not possible on individual databases. At the same time, sensitive information such as personal identifying details or confidential business data need to be protected. PPRL can for example be applied to match health databases without revealing any sensitive personal details of patients, or to detect individuals that have been involved in fraudulent activities across organizations without the need to share the full, potentially confidential, databases. Research in PPRL over the past decade has developed a variety of algorithms, however the challenge of linking very large databases in privacy-preserving, scalable, accurate, and automatic ways is still an open problem. In this half-day tutorial we will illustrate the significance of PPRL through several real-world scenarios, and introduce the concepts, techniques, algorithms, and research directions of PPRL.


Assoc Prof Peter Christen
Research School of Computer Science, The Australian National University,
Canberra, ACT 0200, Australia

Assoc Prof Vassilios S. Verykios
School of Science and Technology, Hellenic Open University,
Patras, Greece

Ms Dinusha Vatsalan
Research School of Computer Science, The Australian National University,
Canberra, ACT 0200, Australia

Aims and Learning Objectives

By giving this tutorial we have several goals. First of all, our aim is to highlight the issues that privacy and confidentiality pose to knowledge and information management in general, but especially in applications where data need to be shared, integrated, and linked across organizations. As recent events have highlighted, we believe this topic is of high interest to researchers as well as to practitioners from both the private and public sectors who are faced with the challenging problem of how to share and integrate sensitive data from different sources without revealing private or confidential information.

Secondly, we consider privacy and confidentiality to be important aspects that need to be taken into consideration by an increasing number of people who work in data management and related areas of the IT sector, for example data curators whose task is to administer data for future use. This is deemed as very important and necessary given the increasing number of regulations originating from public agencies charged with the responsibility of harmonizing the availability and use thereof of personal and other kinds of sensitive data.

Finally, this tutorial is aimed at providing a broad overview of the challenges, key concepts, and the current state-of-the-art algorithms in privacy-preserving record linkage (PPRL). Participants of this tutorial will be able to appreciate the challenges of PPRL and the significance of research in this area, and will understand how the different approaches to PPRL work at a conceptual level.

Our specific learning objectives with this tutorial are to provide the audience with (1) an appreciation of the challenges of PPRL, (2) an overview of the current state-of-the-art of techniques that facilitate data to be shared, integrated, and linked in a privacy-preserving fashion, with a special focus on approaches that scale to very large databases, and (3) a characterization of the current gaps and resulting future directions in this research area.

Description of Topics

Our tutorial is structured into three distinct sections, each covering several topics.

  1. Section 1: The first topic in this section will provide the background of record linkage and data sharing/integration in general, covering the steps and challenges involved in the process of integrating and linking records across databases.

    In the second topic we will illustrate the need for techniques that protect the privacy of information during the linkage process with several real-world examples from diverse application areas, including public health research, business collaboration, and crime investigation.

    We conclude the first section with the topic of a taxonomy of PPRL algorithms and techniques. We will use this taxonomy in the reminder of the tutorial to characterize different PPRL approaches.

  2. Section 2: The first topic in the second section will cover the first generation of PPRL algorithms which allowed the exact matching of attribute values (based on simple hash-encoding algorithms) across organizations without revealing these values. Because in real-world situations data are often `dirty' and contain errors and variations, such exact privacy-preserving matching is inappropriate in many applications. However, these techniques will provide the audience with the basic building blocks required for more advanced techniques.

    We continue with the second topic in section two where we describe algorithms from the second generation of PPRL techniques that allow the secure approximate matching of values (for example via a secure edit-distance or secure Jaccard coefficient approach), but that have high computation and/or communication complexities that make these techniques non-scalable to large databases.

  3. Section 3: The topics of the third section focus on recent PPRL techniques that allow both approximate matching and facilitate linking of large databases (the third generation of PPRL techniques). The topics we will cover include PPRL techniques based on:
    • secure blocking using secure hashing (only comparing records which have certain tokens in common);
    • multi-dimensional embedding schemes (based on mapping attribute values into multi-dimensional spaces using variations of FastMap);
    • differential privacy (adding random noise to database queries following the requirements of differential privacy);
    • reference tables (using publicly available values to `hide' private values);
    • phonetic encoding (using Soundex, NYSIIS and similar techniques to securely group attribute values);
    • Bloom filters (hashing-based bit-arrays); and
    • pre-calculation of similarity values by the database owners.

    We then conclude our tutorial by summarizing the achievements of current techniques and highlighting the challenges that still need to be solved in order to make PPRL both accurate and practical in applications that require the secure matching of very large databases.

Target Audience

This tutorial is targeted both at researchers and practitioners who work in areas that either require the sharing and linking of data across organizations, or that work with sensitive private or confidential data that cannot be accessed, linked, and analyzed without any privacy protection. We believe the content of our tutorial will be of interest to CIKM participants from all three domains (databases, information retrieval, and knowledge management), as privacy and confidentiality increasingly become of relevance in all these domains.

For researchers and graduate students, our tutorial will provide a broad overview of the challenges and current approaches to PPRL, and highlight the directions for future research. For practitioners who deal with sensitive data, our tutorial will provide an overview of current techniques that enable PPRL, and their advantages and drawbacks. Most importantly, our tutorial will highlight that it is possible to link and share data in such ways that no private or confidential information (besides the matched records) is being revealed.

Prerequisite Knowledge of Audience

The content of this tutorial is at the level of a graduate course, such as for first year PhD students. The assumed background knowledge corresponds to introductory courses on data engineering, data mining, databases, algorithms, and data structures. Knowledge about basic cryptographic techniques is an advantage, but we will provide the necessary background information throughout the tutorial.