Research School of Computer Science, The Australian National University,
Canberra, ACT 0200, Australia
Vassilios S. Verykios,
School of Science and Technology, Hellenic Open University,
Privacy-Preserving Record Linkage (PPRL) is an increasingly important topic for data mining as organisations begin to share much more data requiring analysis, and privacy remains a significant concern. PPRL can for example be applied to match health databases without revealing any sensitive personal details of patients, or to detect individuals that have been involved in fraudulent activities across organisations without the need to share the full, potentially confidential, databases. Research in PPRL over the past decade has developed a variety of algorithms, however the challenge of matching very large databases in privacy-preserving, scalable, accurate and automatic ways is still an open problem. In this half-day tutorial we will illustrate the significance of PPRL through real-world scenarios and case-studies, and introduce the concepts, techniques, algorithms and research directions of PPRL.
Over the past decade there has been an increased interest in the field of data mining, and in particular in algorithms and techniques that allow the sharing and analysis of large data collections in such ways that the privacy of the mined data, as well as the knowledge gained from them, is maintained .
Integrating and matching several databases is an important aspect in the initial stages of many data mining projects. The aim of this process is to identify which records in disparate databases refer to the same real-world entities. This allows improvements of data quality (for example by removing duplicate records that refer to the same entity) and facilitates data enrichment (by combining complementary information stored in different databases). This process is commonly known as record linkage, data matching, entity resolution or duplicate detection .
When record linkage is conducted between different organisations, then the privacy and confidentiality of the data required for the linking become a major concern. Often, it is not permissible to exchange identifying information (such as people's personal details) across different organisations, either because of legal regulations or because the data are of confidential nature.
Record linkage poses several major challenges . The first is the lack of common unique entity identifiers, which means that the identification of common entities needs to be based on the available information that is shared among records, such as the names, addresses and dates of birth of individuals. The second challenge is poor data quality, which is tackled by the development of advanced similarity functions that try to distinguish different forms of the same information from actual different information. The third major challenge is the complexity of the linkage process when each record from one database is compared with all records from another database. This challenge is addressed by sophisticated searching and indexing techniques that reduce the large space of record comparisons through blocking, clustering or sorting of the databases.
The fourth major challenge of record linkage, privacy and confidentiality, is addressed by the research area of privacy-preserving record linkage (PPRL). PPRL aims to develop algorithms and techniques that can identify records that refer to the same real-world entities from two or more databases owned by different organisations, such that no private or confidential information is revealed. A challenging aspect of PPRL is that it needs to deal with the previously described three challenges as well.
The past six years have seen a steady increase in research and in the number of publications in PPRL, with a variety of novel techniques being proposed by different research communities [4,6,19]. We have been developing a taxonomy of PPRL algorithms and techniques, and we have identified three distinct generations of techniques (which we will present in this tutorial), each one building upon the previous to produce a holistic approach for solving the PPRL problem.
In this tutorial, we provide an introduction to the general record linkage process and its challenges, highlight the need for PPRL through a series of real world examples and case studies, and cover in detail the key concepts, algorithms and techniques that have been developed for PPRL in the past decade. We will then provide a creative classification of the techniques presented along with an evaluation with respect to different dimensions and metrics that we have identified. We conclude the tutorial with an outlook to further research required in this area.
We will organise the content of this tutorial in three sections:
By giving this tutorial we have several goals. First of all, our aim is to highlight the issues that privacy and confidentiality pose to data mining in general, but especially in applications where databases need to be matched and integrated across organisations. We believe this topic to be of interest to researchers as well as to practitioners from both the private and public sector who are faced with the challenging problem of how to integrate data from different sources.
Secondly, we consider privacy and confidentiality to be important aspects that need to be taken into consideration by an increasing number of people who work in data mining and related areas of the IT sector, especially by data curators whose main task is to administer data for future use. This is deemed as very important and necessary given the increasing number of regulations originating from public agencies charged with the responsibility of harmonising the availability and use thereof of personal and other kinds of sensitive data.
Finally, this tutorial is aimed at providing a broad overview of the challenges, key concepts, and the current state-of-the-art algorithms in PPRL. Participants of this tutorial will be able to appreciate the challenges of PPRL and the significance of research in this area, and will understand how the different approaches to PPRL work at a conceptual level.
The content of this tutorial is at the level of a graduate course, such as for first year PhD students. The assumed background knowledge corresponds to introductory courses on data mining, databases, algorithms and data structures. Knowledge about basic cryptographic techniques is an advantage, but we will provide the necessary background information throughout the tutorial.
Peter Christen is a Senior Lecturer in the Research School of Computer Science at the Australian National University. He received his Diploma in Computer Science Engineering from ETH Zürich in 1995 and his PhD in Computer Science from the University of Basel in 1999 (both in Switzerland).
His research interests are in the areas of data mining and data matching (record linkage), with a special interest in privacy aspects of record linkage. He has published over 60 papers in these areas, and he is the principle developer of Febrl (Freely Extensible Biomedical Record Linkage), an open source data cleaning, deduplication and record linkage system.
Peter serves as a Senior Program Committee member for PAKDD'2012, and he is a Program Chair for the Australasian Data Mining conference in 2011 (AusDM'2011). He has also been a member of the Steering Committee for AusDM since 2006, and he is regularly invited to serve on the program committees of other data mining workshops and conferences. Peter has also served as reviewer for various journals, including IEEE Transactions on Knowledge and Data Engineering; ACM Journal of Data and Information Quality; Springer Data Mining and Knowledge Discovery and Statistics and Computing; and Elsevier Data and Knowledge Engineering, among others.
List of publications: http://cs.anu.edu.au/people/Peter.Christen/publications.php
Vassilios S. Verykios received the Diploma degree in Computer Engineering from the University of Patras in Greece in 1992, and the MS and the PhD degrees from Purdue University in 1997 and 1999, respectively. From 1999 to 2001 he was with the Faculty of Information Systems in the College of Information Science and Technology at Drexel University, Pennsylvania, as a tenure track Assistant Professor. From 2001 to 2005 he held various research positions at Intracom SA, SingularLogic SA and CTI and visiting Assistant Professor positions in the University of Thessaly, Hellenic Open University and the University of Patras. From 2005 to 2011 he was an Assistant Professor in the Department of Computer and Communication Engineering at the University of Thessaly in Volos, Greece. Since January of 2011, he is an Associate Professor in the School of Science and Technology, at the Hellenic Open University in Patras, Greece.
His main research interests include knowledge based systems, privacy, security and anonymity in advanced database systems, data mining, data reconciliation, privacy preserving record linkage, parallel computing and performance evaluation of large scale parallel systems. He has published over 70 papers in major referred journals and in the proceedings of international conferences and workshops. He has recently co-authored a monograph on `Association Rule Hiding for Data Mining' by Springer. Dr. Verykios has served on the program committees of several international scientific events and he has consulted for Telcordia Technologies, ChoiceMaker Technologies, Intracom SA and SingularLogic SA. He has also been a visiting researcher to CERIAS, the Department of Computer Sciences at Purdue University, the US Naval Research Laboratory and the Research and Academic Computer Technology Institute in Patras, Greece.
List of publications: http://media.eap.gr/images/stories/pdf/cv_verykios_en.pdf