6.1 Name and Address Cleaning and Standardisation

Febrl data cleaning and standardisation for the name and address components primarily employs a supervised machine learning approach implemented through a novel application of hidden Markov models (HMMs). A brief introduction to HMMs and their use for data standardisation is given in Chapter 7. Before data standardisation can be performed with a given data set, the user needs to train HMMs using training data from the same or similar data sets. Two HMMs need to be trained, one for names and one for addresses. The process of creating training data is described in Chapter 8. Once HMMs are available for a given data set (or class of data sets), the data cleaning and standardisation process becomes easy and efficient.

The data cleaning and standardisation process for the name and address components in Febrl consists of the three steps described in the following sections.

Febrl also contains the functionality to check word spilling, i.e. if words in a field are cut off because of limited field length (for example in fixed width input fields) and continue in the next field. Section 6.1.4 explains how word spilling is dealt with.