" TYPE="image/x-icon">

Student research and implementation projects in Data Mining and Data Linkage/Matching in 2009

Supervisor for all projects is Peter Christen.

Last update: 14 July 2009.


Data mining projects

  • Why do students fail? Data mining the DCS student database
    (Research project, suitable for PhD and MPhil or Honours)

    The ANU Department of Computer Science (DCS) has been collecting student course enrolments, marks and grades through its FAculty Information System (FAIS) for several years.

    The aim of this research project is to develop data mining techniques that allow analysis of the FAIS database with the objective to better understand student performance, progress and retention. The questions that the DCS is interested in include, for example: why do students stop studying computer science?, what correlations are there between their marks in different courses?, can we predict that a student will have problems in a certain course based on her or his past performance?, can we identify students that might be at risk of failing or dropping out of computer sciences courses early in their studies?

    This project will include the following components:

    • Exploration, development and application of techniques that allow data mining of the multi-relational data that is available in FAIS. This will include using the open source data mining tool Rattle as well as development of specific tools and techniques required for this project (likely using the Python programming language).

    • Development of a data generator - based on real, summarised FAIS data - that will allow creation of synthetic data to be used for later testing and evaluation of the data mining techniques to be developed in this project.

    • Analysis of real, anonymised FAIS student data. This work will be done on a secure compute server and will require that an Ethics approval for this research has been obtained from the Human Ethics committee at the ANU Research Office.

    Note: Parts of this project depend upon the outcomes of the Human Ethics approval, and therefore there might be changes to this project.


  • Data mining of discussion forums
    (Research project, suitable for PhD and MPhil or Honours)

    Increasingly, people communicate with each other using electronic techniques such as e-mails, SMS, or discussion forums, bulletin boards or chat rooms. So far, only limited research has been done in trying to find patterns in such online discussions. This project aims to investigate data mining techniques in order to find patterns in publicly available online discussion forums (for example forums that discuss new movies, DVDs, games, music, or electronic products). The challenges with such data is that participants often use nick names, typos and abbreviations are common, as are slang expressions and emoticons, like ;-) or ;-(.

    Questions that this research could address are: Who are the participants in an online discussion? What topic are they discussing? Can we find new trends being discussed? Who is starting new trends? How are the participants interacting? When and how long are participants online?

    This research is fairly open and involves many challenges, including: extracting the participants and what they write; extracting the conversations in a discussion forum (there might be several discussions going on at the same time); or finding the topics of a discussion by using external data (e.g. by querying a search engine). The research therefore involves techniques such as entity resolution, link analysis, time series analysis, text mining, and information retrieval.


Data linkage/matching projects

Background

Many organisations today collect massive amounts of data in their daily businesses. Examples include credit card and insurance companies, the health sector (e.g. Medicare), taxation, statistics, police and intelligence, and telecommunications. Data mining techniques are used to analyse such large data sets, in order to find patterns and rules, or to detect outliers.

In many data mining projects, information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such data linkages (also called data matching) is to merge all records relating to the same entity, such as a customer or a patient.

Most of the time the linkage process is challenged by the lack of a common unique identifier, and thus becomes non-trivial. Probabilistic linkage techniques have to be applied using personal identifiers like names, addresses, dates of birth, etc.

We have been working in data linkage since 2002. From 2003 to 2008 we were collaborating with the NSW Department of Health, Centre for Epidemiology and Research on the development of improved techniques and algorithms for large scale high-performance data linkage. We have been developing an open source data linkage system called Febrl (Freely extensible biomedical record linkage), which includes modules for data cleaning and standardisation (of names, addresses, dates of birth, etc.), deduplication, data linkage (or matching), and geocoding.

For more information please have a look at our publications as well as presentations that describe our research in more details.

The following student projects address important issues within our data linkage/matching project.

  • Multi-core parallelisation of data linkage techniques
    (Research project, suitable for PhD and MPhil or Honours)

    One of the problems when linking large collections of data with tens or even hundreds of millions of records is the complexity of the linkage techniques. Potentially, each record in one data set has to be compared with every record in the second data set, resulting in a quadratic complexity O(n2). Techniques known as blocking are normally used to reduce the number of record pair comparisons (for example, only records that have the same postcode value are being compared with each other). However, linking large data sets is still a very time consuming process. time.

    Only a very limited amount of research has so far been done in the area of parallel data linkage (we are only aware of less than 10 publications in this area, including one of our own papers: Febrl - A parallel open source data linkage system, PAKDD'04).

    With the increasing availability of multi-core CPUs in most modern desktop, laptop and server machines, new techniques that allow efficient parallelisation of data linkage techniques on such platforms need to be developed. The issues that have to be considered include:

    • data distribution and load balancing,
    • scalability, and
    • privacy and confidentiality (as people's names and addresses are normally used for the linkage).

    A student working on this project would be involved through:

    • Participation in an exciting research project.
    • Exploration and development of parallel computing techniques, as well as data mining, machine learning, and statistical techniques.
    • Object-oriented cross-platform programming (using Python and friends).
    • Open source software development (see http://sourceforge.net/projects/febrl).


  • Large-scale real-time data linkage
    (Research project, suitable for PhD and MPhil or Honours)

    Traditional data linkage approaches have assumed the matching of two static databases. However, in our networked and online world it is becoming increasingly important for many organisations to be able to find if a query record is similar to a record that already exists in a large database, for example containing customer or patient records. Thus, data linkage becomes similar to querying a very large data collections using a search engine, and returning a ranked list of matched records according to their similarity to the query record.

    This project is aimed at investigating approaches developed in information retrieval, such as inverted indexing techniques, and applying them to data linkage, with the ultimate goal to develop techniques that allow real-time linkage of query records with massive databases containing many millions of records.

    We have recently published some initial work into this direction: Towards scalable real-time entity resolution using a similarity-aware inverted index approach , AusDM'08.

    A student working on this project would be involved through:

    • Participation in an exciting research project.
    • Exploration and development of information retrieval techniques, combined with data mining and machine learning methods.
    • Object-oriented cross-platform programming (using Python and friends).
    • Open source software development (see http://sourceforge.net/projects/febrl).


  • Improving matching of bibliographic data
    (Research project, suitable for
    PhD and MPhil or Honours)

    This project has been taken by a student.

    The ANU Office of Research Management Data is currently involved in the matching of ANU publications data with commercial bibliographic databases such as Thompson ISI and Elsevier Scopus. The aim of these linkages is to get accurate citation counts for the publications by ANU authors.

    The matching is challenged by data quality issues (different spelling of titles and journal names, abbreviations, only surnames and initials for authors, etc.), as well as the different naming conventions used by different disciplines. For example, the following is the article title of an ANU chemistry publication:

         Undecacarbonyl(methylcyclopentadienyl)-tetrahedro-triiridiummolybdenum,
         undecacarbonyl(tetramethylcyclopentadienyl)-tetrahedro-triiridiummolybdenum and
         undecacarbonyl(pentamethylcyclopentadienyl)-tetrahedro-triiridiummolybdenum
         
    This illustrates the challenges involved in matching bibliographic databases. The aim of this research project is to develop improved matching techniques specific for the matching of ANU publications.

    Specific issues to be researched are:

    • Development of domain specific approximate string comparison functions.
    • Application of collective entity resolution techniques for improved matching quality.
    • How can online digital libraries (such as Google Scholar) be used to improve data quality and data matching?

    A student working on this project would be involved through:

    • Participation in an exciting research project that has immediate practical impact for the ANU.
    • Exploration and development of string processing, data matching, data mining, and machine learning techniques.
    • Object-oriented cross-platform programming (using Python and friends).
    • Open source software development (see http://sourceforge.net/projects/febrl).


  • Citation tracker
    (
    Implementation project, appropriate courses would be COMP6703 (eSci), COMP8750 (CSys), COMP8770 (eSci), COMP8780 (HCI) and COMP3750 (CSys))
    Technical Difficulty Level: Moderate to high
    Conceptual Difficulty Level: Moderate

    Increasingly, researchers are using tools like Publish or Perish or Google Scholar to get up-to-date information about the impact of their publications, measured by citation numbers.

    While these tools provide the numbers of citations for an author's publications, they do not directly provide information about who is citing these publications. Additionally, an author currently needs to manually check if her or his publications have attracted new citations.

    The aim of this implementation project is to develop a system which keeps track of an author's publications, and is capable to inform an author (via RSS feed or e-mail notification) if new citations have been added. This system should also allow an author to get statistical information about her or his citations (for example country distribution, which journals or conferences, etc.). A student working on this project would be involved through:

    • Participation in an exciting project that has immediate practical impact for many researchers world-wide.
    • Object-oriented cross-platform programming (using Python and friends).
    • Open source software development.


  • All-in-one Windows installer for Febrl open source data linkage system
    (
    Implementation project, appropriate courses would be COMP6703 (eSci), COMP8750 (CSys), COMP8770 (eSci), COMP8780 (HCI) and COMP3750 (CSys))
    Technical Difficulty Level: Moderate
    Conceptual Difficulty Level: Low to moderate

    The Febrl (Freely extensible biomedical record linkage) open source data linkage system developed by the ANU Data Mining group since 2002 is based on the Python language and requires various additional libraries and packages. This makes installing Febrl a non-trivial process for inexperienced users, especially on Windows based systems.

    Specifically, the following packages and libraries are currently required in order to make the graphical user interface (GUI) for Febrl and all its functionality fully operational:

    • The PyGTK graphical user interface system.
    • The NumPy numerical library.
    • The Matplotlib plotting library.
    • The LIBSVM library for support vector machine (SVM) classification.

    For more details on the Febrl GUI as well as the Febrl manual please see the following documents:

    The aim of this implementation project is to investigate, develop and test an all-in-one installer for Windows that allows simple installation of all Febrl related packages without any technical details required by the user (similar to many Windows type installers). Ideally, such an installer would be general enough for other open source projects that require various packages and libraries being installed at the same time.

    A student working on this project would be involved through:

    • Participation in an exciting open source software development and research project.
    • Open source software development (see http://sourceforge.net/projects/febrl).
    • Investigation of available open source Windows installation systems.
    • Investigation of open source licensing.
    • Object-oriented cross-platform programming (using Python and friends).


  • Project load module for the Febrl graphical user interface (GUI)
    (
    Implementation project, appropriate courses would be COMP6703 (eSci), COMP8750 (CSys), COMP8770 (eSci) and COMP3750 (CSys))
    Technical Difficulty Level: High
    Conceptual Difficulty Level: Moderate

    A graphical user interface (GUI) has been developed for Febrl last year, however so far this GUI does not allow a saved Febrl project file to be loaded back into the GUI, limiting the GUI's functionality and practicality.

    The aim of this implementation project is to develop, implement and test a Python module that allows loading of an existing Febrl project into the Febrl GUI, and integrating this module into the Febrl GUI. The Febrl GUI is written in the Python programming language and uses the PyGTK library.

    For more details on the Febrl GUI, as well as the Febrl manual please see the following documents:

    A student working on this project would be involved through:

    • Participation in an exciting open source software development and research project.
    • Open source software development (see http://sourceforge.net/projects/febrl).
    • Object-oriented cross-platform programming (using Python and friends).


  • Graphical user interface (GUI) for Febrl hidden Markov training module
    (
    Implementation project, appropriate courses would be COMP6703 (eSci), COMP8740 (AI), COMP8750 (CSys), COMP8760 (CSci), COMP8770 (eSci), COMP8780 (HCI); COMP3750 (CSys))
    Technical Difficulty Level: Moderate to high
    Conceptual Difficulty Level: Moderate to high

    The Febrl (Freely extensible biomedical record linkage) open source data linkage system developed by the ANU Data Mining group includes a module that allows cleaning and standardisation of names and addresses using a probabilistic approach based on hidden Markov models (HMM). This approach has shown to produce better standardisation results than a commercial rules based system.

    A graphical user interface (GUI) has been developed for Febrl last year, however this GUI does so far not include an interface for the Febrl HMM training module, and so a user currently needs to conduct this training outside of the GUI using a text editor as well as terminal based commands.

    For more details on the Febrl HMM standardisation approach and the Febrl GUI please see the following three papers:

    The Febrl GUI is written in the Python programming language and uses the PyGTK library. The aim of this student project is to extend the existing Febrl GUI with a graphical interface to the Febrl HMM training module.

    A student working on this project would be involved through:

    • Participation in an exciting open source software development and research project.
    • Open source software development (see http://sourceforge.net/projects/febrl).
    • Object-oriented cross-platform programming (using Python and friends).
    • GUI development using PyGTK.


  • Graphical user interface (GUI) for Febrl data generator
    (
    Implementation project, appropriate courses would be COMP6703 (eSci), COMP8740 (AI), COMP8750 (CSys), COMP8760 (CSci), COMP8770 (eSci), COMP8780 (HCI); COMP3750 (CSys))
    Technical Difficulty Level: Moderate to high
    Conceptual Difficulty Level: Moderate to high

    The Febrl (Freely extensible biomedical record linkage) open source data linkage system developed by the ANU Data Mining group includes a data generator that allows creation of synthetic data sets that contain randomly generated names, addresses, telephone numbers, etc. This is an important part of data linkage research, as getting access to real world data sets containing real names and addresses is pretty much impossible because such data is protected by privacy and confidentiality regulations.

    A graphical user interface (GUI) has been developed for Febrl last year, however this GUI does so far not include an interface for the Febrl data set generator.

    For more details on the Febrl data set generator and the Febrl GUI please see the following two papers:

    The Febrl GUI is written in the Python programming language and uses the PyGTK library. The aim of this student project is to extend the existing Febrl GUI with a graphical interface to the Febrl data set generator.

    A student working on this project would be involved through:

    • Participation in an exciting open source software development and research project.
    • Open source software development (see http://sourceforge.net/projects/febrl).
    • Object-oriented cross-platform programming (using Python and friends).
    • GUI development using PyGTK.


  • Online database querying for historical UK census data linkage
    (
    Implementation project, appropriate courses would be COMP6703 (eSci), COMP8750 (CSys), COMP8760 (CSci), COMP8770 (eSci), COMP3750 (CSys))
    Technical Difficulty Level: Moderate
    Conceptual Difficulty Level: Moderate

    This project will be co-supervised by Dr Mac Boot from the Australian Demographic and Social Research Institute, ANU College of Arts and Social Sciences.

    The Australian Demographic and Social Research Institute at the ANU College of Arts and Social Sciences is planning to conduct a research project analysing historical census data from the United Kingdom originating from the nineteenth century.

    One task of this research involves enriching the available UK census data with detailed information about people's births, marriages and deaths. This information is available online from the FreeBMD database. We have received permission to conduct (limited) automated online querying of the FreeBMD database.

    The aim of this project would be analyse the FreeBMD interface, and specify and implement a prototype that will allow automated querying of this database using a sample of UK census records.

    A student working on this project would be involved through:

    • Participation in an exciting research project.
    • Analysis of an online database and designing prototype software to query the database.
    • Object-oriented cross-platform programming (using Python and friends).


    Last modified: 13/07/2009, 22:50