PAKDD-2012 Tutorial on Privacy-Preserving Record Linkage

Tutorial for PAKDD-2012; Kuala Lumpur, 29 May-1 June 2012

Privacy-Preserving Record Linkage

Peter Christen,
Research School of Computer Science, The Australian National University,
Canberra, ACT 0200, Australia

Vassilios S. Verykios,
School of Science and Technology, Hellenic Open University,
Patras, Greece

Abstract

Privacy-Preserving Record Linkage (PPRL) is an increasingly important topic for data mining as organisations begin to share much more data requiring analysis, and privacy remains a significant concern. PPRL can for example be applied to match health databases without revealing any sensitive personal details of patients, or to detect individuals that have been involved in fraudulent activities across organisations without the need to share the full, potentially confidential, databases. Research in PPRL over the past decade has developed a variety of algorithms, however the challenge of matching very large databases in privacy-preserving, scalable, accurate and automatic ways is still an open problem. In this half-day tutorial we will illustrate the significance of PPRL through real-world scenarios and case-studies, and introduce the concepts, techniques, algorithms and research directions of PPRL.

Slides

Background

Over the past decade there has been an increased interest in the field of data mining, and in particular in algorithms and techniques that allow the sharing and analysis of large data collections in such ways that the privacy of the mined data, as well as the knowledge gained from them, is maintained [6].

Integrating and matching several databases is an important aspect in the initial stages of many data mining projects. The aim of this process is to identify which records in disparate databases refer to the same real-world entities. This allows improvements of data quality (for example by removing duplicate records that refer to the same entity) and facilitates data enrichment (by combining complementary information stored in different databases). This process is commonly known as record linkage, data matching, entity resolution or duplicate detection [8].

When record linkage is conducted between different organisations, then the privacy and confidentiality of the data required for the linking become a major concern. Often, it is not permissible to exchange identifying information (such as people's personal details) across different organisations, either because of legal regulations or because the data are of confidential nature.

Record linkage poses several major challenges [8]. The first is the lack of common unique entity identifiers, which means that the identification of common entities needs to be based on the available information that is shared among records, such as the names, addresses and dates of birth of individuals. The second challenge is poor data quality, which is tackled by the development of advanced similarity functions that try to distinguish different forms of the same information from actual different information. The third major challenge is the complexity of the linkage process when each record from one database is compared with all records from another database. This challenge is addressed by sophisticated searching and indexing techniques that reduce the large space of record comparisons through blocking, clustering or sorting of the databases.

The fourth major challenge of record linkage, privacy and confidentiality, is addressed by the research area of privacy-preserving record linkage (PPRL). PPRL aims to develop algorithms and techniques that can identify records that refer to the same real-world entities from two or more databases owned by different organisations, such that no private or confidential information is revealed. A challenging aspect of PPRL is that it needs to deal with the previously described three challenges as well.

The past six years have seen a steady increase in research and in the number of publications in PPRL, with a variety of novel techniques being proposed by different research communities [4,6,19]. We have been developing a taxonomy of PPRL algorithms and techniques, and we have identified three distinct generations of techniques (which we will present in this tutorial), each one building upon the previous to produce a holistic approach for solving the PPRL problem.

In this tutorial, we provide an introduction to the general record linkage process and its challenges, highlight the need for PPRL through a series of real world examples and case studies, and cover in detail the key concepts, algorithms and techniques that have been developed for PPRL in the past decade. We will then provide a creative classification of the techniques presented along with an evaluation with respect to different dimensions and metrics that we have identified. We conclude the tutorial with an outlook to further research required in this area.

Tutorial outline

We will organise the content of this tutorial in three sections:

The first section will provide the background of record linkage in general, covering the history of record linkage and the steps and challenges involved in the process of linking records across databases. We will illustrate the need for techniques that protect the privacy of information during the linkage process with several real-world examples and case studies. We conclude the first section with an overview of our taxonomy of PPRL algorithms and techniques, and with detailing the notation and metrics we will use in the reminder of the tutorial.
The second section will cover the first two generations of PPRL algorithms and techniques. We start with the first generation of basic hash-encoding algorithms that only allow the secure exact matching of values [1,7], making these techniques inappropriate in real world situations where data are often `dirty' and contain errors and variations. We then continue describing algorithms from the second generation of PPRL techniques, that allow the secure approximate matching of values (for example via a secure edit-distance or secure Jaccard coefficient approach), but that have high computation and/or communication complexities that make these techniques non-scalable to large databases [3,5,15].
The third section focuses on more recent techniques that allow both approximate matching and that take scalability into consideration (the third generation of PPRL techniques). We will present recently proposed techniques based on secure blocking [2], embedding schemes [9,16], differential privacy [10], reference tables [14], phonetic encodings [11,13], Bloom filters [12,17], and special indexing techniques [18]. We then conclude by summarising the achievements of current techniques and highlight the challenges that still need to be solved in order to make PPRL both accurate and practical in applications that require the secure matching of very large real-world databases.

Specific goals and objectives

By giving this tutorial we have several goals. First of all, our aim is to highlight the issues that privacy and confidentiality pose to data mining in general, but especially in applications where databases need to be matched and integrated across organisations. We believe this topic to be of interest to researchers as well as to practitioners from both the private and public sector who are faced with the challenging problem of how to integrate data from different sources.

Secondly, we consider privacy and confidentiality to be important aspects that need to be taken into consideration by an increasing number of people who work in data mining and related areas of the IT sector, especially by data curators whose main task is to administer data for future use. This is deemed as very important and necessary given the increasing number of regulations originating from public agencies charged with the responsibility of harmonising the availability and use thereof of personal and other kinds of sensitive data.

Finally, this tutorial is aimed at providing a broad overview of the challenges, key concepts, and the current state-of-the-art algorithms in PPRL. Participants of this tutorial will be able to appreciate the challenges of PPRL and the significance of research in this area, and will understand how the different approaches to PPRL work at a conceptual level.

Expected background of the audience

The content of this tutorial is at the level of a graduate course, such as for first year PhD students. The assumed background knowledge corresponds to introductory courses on data mining, databases, algorithms and data structures. Knowledge about basic cryptographic techniques is an advantage, but we will provide the necessary background information throughout the tutorial.

Bibliography

1: Agrawal R, Evfimievski A and Srikant, R: Information sharing across private databases. SIGMOD, San Diego, pp. 86-97, 2003.
2: Al-Lawati A, Lee D and McDaniel P: Blocking-aware private record linkage. IQIS, Baltimore, pp. 59-68, 2005.
3: Atallah MJ, Kerschbaum F and Du W: Secure and private sequence comparisons. WPES, Washington DC, pp. 39-44, 2003.
4: Christen P: Privacy-preserving data linkage and geocoding: Current approaches and research directions. Workshop on Privacy Aspects of Data Mining, held at IEEE ICDM, Hong Kong, 2006.
5: Churches T and Christen P: Some methods for blindfolded record linkage. BioMed Central Medical Informatics and Decision Making, 4(9), 2004.
6: Clifton C, Kantarcioglu M, Doan A, Schadow G, Vaidya J, Elmagarmid A and Suciu D: Privacy-preserving data integration and sharing. ACM SIGMOD workshop on Research Issues in Data Mining and Knowledge Discovery, Paris, pp. 19-26, 2004.
7: Dusserre L, Quantin C and Bouzelat H: A one way public key cryptosystem for the linkage of nominal files in epidemiological studies. Medinfo, 8:644-7, 1995.
8: Elmagarmid AK, Ipeirotis PG and Verykios VS: Duplicate record detection: A survey. IEEE TKDE 19(1), pp. 1-16, 2007.
9: Inan A, Kantarcioglu M, Bertino E and Scannapieco M: A hybrid approach to private record linkage. IEEE ICDE, Cancun, Mexico, pp. 496-505, 2008.
10: Inan A, Kantarcioglu M, Ghinita G and Bertino E: Private record matching using differential privacy. ACM EDBT, Lausanne, Switzerland, pp. 123-134, 2010.
11: Karakasidis A and Verykios VS: Privacy preserving record linkage using phonetic codes. Fourth Balkan Conference in Informatics, Thessaloniki, pp. 101-106, 2009.
12: Karakasidis A and Verykios VS: Secure Blocking + Secure Matching = Secure Record Linkage Journal of Computing Science and Engineering, 5(3), pp. 223-235, 2011.
13: Karakasidis A, Verykios VS and Christen P: Fake Injection Strategies for Private Phonetic Matching. International Workshop on Data Privacy Management, Leuven, Belgium, 2011.
14: Pang C, Gu L, Hansen D and Maeder A: Privacy-preserving fuzzy matching using a public reference table. Intelligent Patient Management, pp. 71-89, 2009.
15: Ravikumar P, Cohen WW and Fienberg SE: A secure protocol for computing string distance metrics. Privacy and Security Aspects of Data Mining workshop, held at IEEE ICDM, Brighton, UK, 2004.
16: Scannapieco M, Figotin I, Bertino E and Elmagarmid AK: Privacy preserving schema and data matching. SIGMOD, Beijing, pp. 653-664, 2007.
17: Schnell R, Bachteler T and Reiher J: Privacy-preserving record linkage using Bloom filters. BioMed Central Medical Informatics and Decision Making, 9(1), 2009.
18: Vatsalan D, Christen P and Verykios VS: An Efficient Two-Party Protocol for Approximate Matching in Private Record Linkage. Australasian Data Mining Conference (AusDM), Ballarat, 2011.
19: Verykios VS, Karakasidis A and Mitrogiannis VK: Privacy preserving record linkage approaches. International Journal of Data Mining, Modelling and Management, 1(2), pp. 206-221, 2009.
20: Yakout M, Atallah MJ and Elmagarmid AK: Efficient private record linkage. IEEE ICDE, pp. 1283-1286, Shanghai, 2009.

Presenter biographies

Peter Christen is a Senior Lecturer in the Research School of Computer Science at the Australian National University. He received his Diploma in Computer Science Engineering from ETH Zürich in 1995 and his PhD in Computer Science from the University of Basel in 1999 (both in Switzerland).

His research interests are in the areas of data mining and data matching (record linkage), with a special interest in privacy aspects of record linkage. He has published over 60 papers in these areas, and he is the principle developer of Febrl (Freely Extensible Biomedical Record Linkage), an open source data cleaning, deduplication and record linkage system.

Peter serves as a Senior Program Committee member for PAKDD'2012, and he is a Program Chair for the Australasian Data Mining conference in 2011 (AusDM'2011). He has also been a member of the Steering Committee for AusDM since 2006, and he is regularly invited to serve on the program committees of other data mining workshops and conferences. Peter has also served as reviewer for various journals, including IEEE Transactions on Knowledge and Data Engineering; ACM Journal of Data and Information Quality; Springer Data Mining and Knowledge Discovery and Statistics and Computing; and Elsevier Data and Knowledge Engineering, among others.

Homepage: http://cs.anu.edu.au/people/Peter.Christen
List of publications: http://cs.anu.edu.au/people/Peter.Christen/publications.php

Vassilios S. Verykios received the Diploma degree in Computer Engineering from the University of Patras in Greece in 1992, and the MS and the PhD degrees from Purdue University in 1997 and 1999, respectively. From 1999 to 2001 he was with the Faculty of Information Systems in the College of Information Science and Technology at Drexel University, Pennsylvania, as a tenure track Assistant Professor. From 2001 to 2005 he held various research positions at Intracom SA, SingularLogic SA and CTI and visiting Assistant Professor positions in the University of Thessaly, Hellenic Open University and the University of Patras. From 2005 to 2011 he was an Assistant Professor in the Department of Computer and Communication Engineering at the University of Thessaly in Volos, Greece. Since January of 2011, he is an Associate Professor in the School of Science and Technology, at the Hellenic Open University in Patras, Greece.

His main research interests include knowledge based systems, privacy, security and anonymity in advanced database systems, data mining, data reconciliation, privacy preserving record linkage, parallel computing and performance evaluation of large scale parallel systems. He has published over 70 papers in major referred journals and in the proceedings of international conferences and workshops. He has recently co-authored a monograph on `Association Rule Hiding for Data Mining' by Springer. Dr. Verykios has served on the program committees of several international scientific events and he has consulted for Telcordia Technologies, ChoiceMaker Technologies, Intracom SA and SingularLogic SA. He has also been a visiting researcher to CERIAS, the Department of Computer Sciences at Purdue University, the US Naval Research Laboratory and the Research and Academic Computer Technology Institute in Patras, Greece.

Homepage: http://www.eap.gr/view.php?artid=2268
List of publications: http://media.eap.gr/images/stories/pdf/cv_verykios_en.pdf