Before data cleaning and standardisation can be performed using HMMs as described in Chapter 7 on a new data set, two HMMs - one for names and one for addresses - need to be trained using training records from the same data set which one wants to clean and standardise, or from a similar data set (or data sets).
Training data consists of comma
separated sequences with tag:hmm_state
pairs. Each sequence is
a training example that is given to the HMM, and the HMM learns
the characteristics of a data set by using all training examples that
it is given during training. Maximum likelihood estimates (MLEs) for
the matrix of transition and observation probabilities (see
Chapter 7) for an HMM are derived by
accumulating frequency counts of each type of transition and output
from the training examples. Because frequency-based MLEs are used, it
is important that the records in the training data set(s) are
reasonably representative of the overall data set(s) to be
standardised. The tagdata.py module (see
Section 8.1 below) provides various options to
automatically choose a random sub-set of records from a larger input
to act as training records. However, the HMMs are quite robust and are
not overly troubled if the records in the training data set(s) do not
represent an unbiased sample of records from the target data. For
example, it is possible to add training records which represent
unusual records without unduly degrading the performance of the
HMMs on more typical records. HMMs also degrade gracefully, in
that they still perform well even with records which are formatted in
a previously unencountered manner. A simple set of five training
examples for the name component might look like
GF:gname1, SN:sname1
UN:gname1, SN:sname1
GF:gname1, GM:gname2, UN:sname1
GF:gname1, GM:sname1
GF:gname1, UN:gname2, SN:sname1
Each line in the example above corresponds to one training record, and contains a sequence that corresponds to a particular path through the various (hidden, unobserved) states of the HMM. Each sequence is made of pairs containing an observation symbol (or tag in our case, the uppercase two-letter word), followed by a colon and a HMM state (lower-case entity following the colon). Theses training examples are extracted from the original data set using the tagdata.py program and the HMMs are created using the trainhmm.py (see Section 8.2 below) program. The following is a basic step-by-step guide for hidden Markov model training within Febrl:
'#'
at the
beginning of the line). For correct tag sequences, add the
appropriate HMM state for each tag of the sequence. Be sure to
use lowercase for the state names, and to only use state names
that are listed in Appendix A. Do not
leave any spaces between the tag name (in uppercase), the
separating colon and the state name (in lowercase). Do not
remove the commas which separate each tag:hmm_state
pair. Only one training sequence (one line) should be activated
(that is, not commented out) per input data record. We plan to
provide a simple graphical user interface to make this editing
task faster in a later version of Febrl.
hmm_smoothing
to either None
, laplace
or
absdiscount
. Borkar et al. [4] suggest that
absolute discounting seems to work best, but we have also had
good results with Laplace smoothing.
hmm_file_name
to the name of the initial HMM created in
the previous step. In this way the initial HMM which you created
using just 100 training records will be used to bootstrap
the tagging of this second, larger
training file. This reduces the amount of work which needs to be
done by the person editing the training file, because it is much
easier to correct existing states associated with each tag than
it is to add states de novo, as was necessary in step 2
above.
hmm_file_name
to the name of the second
HMM file you just created, plus setting the name of the
retag_file_name
to the name of the second, corrected
training file. Be sure to specify a different output file name
for this third training file so that the second training file
which you are reprocessing is not overwritten. Set the smoothing
option as desired.
tag:hmm_state
pairs has changed
will be marked with '***** Changed'
. Examine only the
changed records (since the unchanged records you have already
verified in previous steps) and correct whose which appear to be
incorrect. Repeat the previous three steps until no more changes
are detected in the training file. Note that you also wish to
edit the correction lists and look-up tables to improve the
observation symbol tagging of the training data (as opposed to
the hidden state assignment). You should re-tag the training
file and retrain the HMM after making such modifications to
ensure that the changes have not caused other problems, and to
ensure that the HMM reflects the latest version of the tagging
look-up tables.
This training cycle can be enhanced in various ways. For example, once
a reasonably good HMM training set has been developed, it can be
further improved by adding examples of unusual records to it. By
definition, such unusual records occur in the input data only
infrequently, and thus very large numbers of training records would
need to be examined if they were to be found manually. However, by
setting the freqs_file_name
option in tagdata.py it
is possible to obtain a listing of record formats in ascending order
of frequency in a particular input data set (that is, the least common
record formats are listed first). Typically quite large numbers of
records are specified for processing by tagdata.py when the
freqs_file_name
option is set - 100,000 would not be
unusual. Of course, there is no prospect of being able to inspect all
100,000 records in the resulting training file, but records with
unusual patterns amongst those 100,000 records can be found at or near
the top of the output file specified with the freqs_file_name
option. Corrected versions of the tag:hmm_state
pair sequences
for those unusual records can then be added to the smaller, 1,000
record training file as discussed above, and the HMM then retrained
using the trainhmm.py program.