Traditional data cleaning and standardisation programs have used various rule-based approaches to the task of parsing raw data. Typically, programmers have used regular expressions (as implemented in tools such as awk, grep or agrep), or other pattern-matching languages such as SNOBOL to search for particular signatures in the input data in order to work out how to segment it. However, pattern-matching languages in general and regular expressions in particular are not for the faint-of-heart.
The AutoStan7.1  program improved on simple (or far-from-simple) regular expressions by using an initial lexicon-based tokenisation phase followed by a re-entrant rule-based parsing and data re-writing phase. The Febrl system also uses lexicon-based tokenisation, but then uses a probabilistic approach based on hidden Markov models (HMMs)  to assign each word in the input string to a particular output field.
HMMs were developed in the 1960s and 1970s and are widely used in speech and natural language processing . The use of HMMs in data standardisation was the topic of two recent research papers. Borkar et al.  present a nested HMM approach for text segmentation (which is the task of segmenting an input string into well defined output fields, so basically the same as data standardisation) of Asian and American addresses and bibliographic records, and Seymore et al.  discuss how the structure of HMMs can be learned from example data for the task of information extraction (e.g. extracting names, titles and keywords from publication abstracts).
An HMM is a probabilistic finite state machine made of a set of unobserved (hidden) states, transition edges between these states and a finite dictionary of discrete observation (output) symbols. Each edge is associated with a transition probability, and each state emits observation symbols from the dictionary with a certain probability distribution. Transition and observation probabilities are stored in two matrices, as shown in the two tables below. The sum of all transition probabilities out of a given state (one row in Table 7.1) has to equal 1.0, as does the sum of all observation probabilities for a particular state.
Figure 7.1 shows a simple HMM example for names with six states, including the Start and End states (which are both virtual states that are not actually stored in an HMM as no observation symbols are emitted in these states). A list of initial state probabilities is used instead of the Start state (i.e. probabilities that give the likelihood of a sequence starting in a certain state). In the given example HMM, there is a probability of 0.55 that a name starts with a Givenname and is followed with a (conditional) probability of 0.65 by a Surname, or a probability of 0.25 by a Middlename, and so on.
In the Febrl data standardisation system, instead of using the original words, numbers and other elements from the input records directly as observation symbols, each element (token) in the input record is tagged (as discussed in Chapter 6) using various look-up tables and rules, and these tags are used as the HMM observation symbols. This is done to make the derived HMMs more general, i.e. to allow HMMs to be trained on one data set and then be used with other similar, but distinct data sets, with little or no loss of performance, while still taking advantage of readily-available information such as lists of suburb and town (locality) names and postal codes. This differs from the approach taken by Borkar et al.  in which words in the input data are classified only by their type (numeric, alphanumeric or purely alphabetic), not their value, before an HMM is fitted to the data. In Febrl the input data is automatically tagged first using look-up tables (lexicons) and some simple rules. The result is a sequence of tags (one tag per input word or number), and these tag sequences are given to the HMM, which assigns them to its states (one tag per state). For a given tag sequence the most probable path through the HMM is determined using the Viterbi algorithm .
Let's take an example to make this a bit clearer. Assume an input record contains the name component
'doctor peter paul miller'
'GF'(female given names),
'GM'(male given names),
'UN'for unknown words, as used in the name HMM example above. If the word
'doctor'is found in the title look-up table, it is assigned a
'TI'tag. Assuming the name
'peter'is found in both the male given name and the surname look-up tables, it is assigned the two tags
'SN'. The tag
'GM'is given to the name
'paul'assuming it is only found in the male given name look-up table, and
'miller'is assigned a
'SN'tag assuming it is found in the surnames look-up table. Because
'peter'was assigned two tags, the possible permutations of the input tags are the two tag sequences
['TI', 'GM', 'GM', 'SN']
['TI', 'SN', 'GM', 'SN']
['TI', 'GM', 'GM', 'SN']can be assigned to the following path through the HMM (with the corresponding observation symbols in brackets)
Start -> Title (TI) -> Givenname (GM) -> Middlename (GM) -> Surname (SN) -> End
0.30 * 0.96 * 0.85 * 0.35 * 0.25 * 0.33 * 1.00 * 0.45 * 0.75 = 0.0023856525
0.30being the transition probability from state
Title(initial probability), then
0.96being the probability that the symbol
'TI'is observed in state
Titleand so on. Another possible path through the HMM for the same tag sequence would be
Start -> Title (TI) -> Surname (GM) -> Givenname (GM) -> Surname (SN) -> End
0.30 * 0.96 * 0.10 * 0.15 * 0.20 * 0.35 * 0.65 * 0.45 * 0.75 = 0.0000663390
'doctor'will become the title,
'peter'will become the given name,
'paul'the middle name and
Both the transition and observation probabilities have to be trained using training data sets, i.e. collections of records which are taken from the same (or a similar) data set which will be used for data standardisation and that have been annotated manually. A HMM thus learns the characteristics of a data set. Results of experiments both with names and addresses are published in [7,8].
Thus, with this HMM based standardisation method, instead of requiring highly trained programming staff to maintain a large number of complex rules written using arcane regular expression syntaxes to handle the myriad of special cases and exceptions which occur in real-life data, the data needed to train the HMMs used by Febrl can easily be created by clerical staff within a couple of days. Furthermore, this training process can be accelerated by bootstrapping it with training data sets derived from other, similar data sources. The process of HMM training for the Febrl system is explained in more detail in Chapter 8.