Before data cleaning and standardisation can be performed using HMMs as described in Chapter 7 on a new data set, two HMMs - one for names and one for addresses - need to be trained using training records from the same data set which one wants to clean and standardise, or from a similar data set (or data sets).
Training data consists of comma
separated sequences with
tag:hmm_state pairs. Each sequence is
a training example that is given to the HMM, and the HMM learns
the characteristics of a data set by using all training examples that
it is given during training. Maximum likelihood estimates (MLEs) for
the matrix of transition and observation probabilities (see
Chapter 7) for an HMM are derived by
accumulating frequency counts of each type of transition and output
from the training examples. Because frequency-based MLEs are used, it
is important that the records in the training data set(s) are
reasonably representative of the overall data set(s) to be
standardised. The tagdata.py module (see
Section 8.1 below) provides various options to
automatically choose a random sub-set of records from a larger input
to act as training records. However, the HMMs are quite robust and are
not overly troubled if the records in the training data set(s) do not
represent an unbiased sample of records from the target data. For
example, it is possible to add training records which represent
unusual records without unduly degrading the performance of the
HMMs on more typical records. HMMs also degrade gracefully, in
that they still perform well even with records which are formatted in
a previously unencountered manner. A simple set of five training
examples for the name component might look like
GF:gname1, GM:gname2, UN:sname1
GF:gname1, UN:gname2, SN:sname1
Each line in the example above corresponds to one training record, and contains a sequence that corresponds to a particular path through the various (hidden, unobserved) states of the HMM. Each sequence is made of pairs containing an observation symbol (or tag in our case, the uppercase two-letter word), followed by a colon and a HMM state (lower-case entity following the colon). Theses training examples are extracted from the original data set using the tagdata.py program and the HMMs are created using the trainhmm.py (see Section 8.2 below) program. The following is a basic step-by-step guide for hidden Markov model training within Febrl:
'#'at the beginning of the line). For correct tag sequences, add the appropriate HMM state for each tag of the sequence. Be sure to use lowercase for the state names, and to only use state names that are listed in Appendix A. Do not leave any spaces between the tag name (in uppercase), the separating colon and the state name (in lowercase). Do not remove the commas which separate each
tag:hmm_statepair. Only one training sequence (one line) should be activated (that is, not commented out) per input data record. We plan to provide a simple graphical user interface to make this editing task faster in a later version of Febrl.
absdiscount. Borkar et al.  suggest that absolute discounting seems to work best, but we have also had good results with Laplace smoothing.
hmm_file_nameto the name of the initial HMM created in the previous step. In this way the initial HMM which you created using just 100 training records will be used to bootstrap the tagging of this second, larger training file. This reduces the amount of work which needs to be done by the person editing the training file, because it is much easier to correct existing states associated with each tag than it is to add states de novo, as was necessary in step 2 above.
hmm_file_nameto the name of the second HMM file you just created, plus setting the name of the
retag_file_nameto the name of the second, corrected training file. Be sure to specify a different output file name for this third training file so that the second training file which you are reprocessing is not overwritten. Set the smoothing option as desired.
tag:hmm_statepairs has changed will be marked with
'***** Changed'. Examine only the changed records (since the unchanged records you have already verified in previous steps) and correct whose which appear to be incorrect. Repeat the previous three steps until no more changes are detected in the training file. Note that you also wish to edit the correction lists and look-up tables to improve the observation symbol tagging of the training data (as opposed to the hidden state assignment). You should re-tag the training file and retrain the HMM after making such modifications to ensure that the changes have not caused other problems, and to ensure that the HMM reflects the latest version of the tagging look-up tables.
This training cycle can be enhanced in various ways. For example, once
a reasonably good HMM training set has been developed, it can be
further improved by adding examples of unusual records to it. By
definition, such unusual records occur in the input data only
infrequently, and thus very large numbers of training records would
need to be examined if they were to be found manually. However, by
freqs_file_name option in tagdata.py it
is possible to obtain a listing of record formats in ascending order
of frequency in a particular input data set (that is, the least common
record formats are listed first). Typically quite large numbers of
records are specified for processing by tagdata.py when the
freqs_file_name option is set - 100,000 would not be
unusual. Of course, there is no prospect of being able to inspect all
100,000 records in the resulting training file, but records with
unusual patterns amongst those 100,000 records can be found at or near
the top of the output file specified with the
option. Corrected versions of the
tag:hmm_state pair sequences
for those unusual records can then be added to the smaller, 1,000
record training file as discussed above, and the HMM then retrained
using the trainhmm.py program.