8. Hidden Markov Model Training

Note: A clearer, more detailed guide to the hidden Markov model (HMM) training process will appear in this chapter in future versions of this manual. However, at this stage the authors are themselves still working out the best and most efficient procedures for training HMMs in order to get the best possible results with the minimum of effort.

Before data cleaning and standardisation can be performed using HMMs as described in Chapter 7 on a new data set, two HMMs - one for names and one for addresses - need to be trained using training records from the same data set which one wants to clean and standardise, or from a similar data set (or data sets).

Training data consists of comma separated sequences with tag:hmm_state pairs. Each sequence is a training example that is given to the HMM, and the HMM learns the characteristics of a data set by using all training examples that it is given during training. Maximum likelihood estimates (MLEs) for the matrix of transition and observation probabilities (see Chapter 7) for an HMM are derived by accumulating frequency counts of each type of transition and output from the training examples. Because frequency-based MLEs are used, it is important that the records in the training data set(s) are reasonably representative of the overall data set(s) to be standardised. The tagdata.py module (see Section 8.1 below) provides various options to automatically choose a random sub-set of records from a larger input to act as training records. However, the HMMs are quite robust and are not overly troubled if the records in the training data set(s) do not represent an unbiased sample of records from the target data. For example, it is possible to add training records which represent unusual records without unduly degrading the performance of the HMMs on more typical records. HMMs also degrade gracefully, in that they still perform well even with records which are formatted in a previously unencountered manner. A simple set of five training examples for the name component might look like

GF:gname1, SN:sname1
UN:gname1, SN:sname1
GF:gname1, GM:gname2, UN:sname1
GF:gname1, GM:sname1
GF:gname1, UN:gname2, SN:sname1

Each line in the example above corresponds to one training record, and contains a sequence that corresponds to a particular path through the various (hidden, unobserved) states of the HMM. Each sequence is made of pairs containing an observation symbol (or tag in our case, the uppercase two-letter word), followed by a colon and a HMM state (lower-case entity following the colon). Theses training examples are extracted from the original data set using the tagdata.py program and the HMMs are created using the trainhmm.py (see Section 8.2 below) program. The following is a basic step-by-step guide for hidden Markov model training within Febrl:

First, create a file with a small number of training records using the tagdata.py program. About 100 records should be enough.

Open this file with your favourite text editor. Modify the tagged training records. Comment out lines that have an incorrect tag sequence (add a hash character '#' at the beginning of the line). For correct tag sequences, add the appropriate HMM state for each tag of the sequence. Be sure to use lowercase for the state names, and to only use state names that are listed in Appendix A. Do not leave any spaces between the tag name (in uppercase), the separating colon and the state name (in lowercase). Do not remove the commas which separate each tag:hmm_state pair. Only one training sequence (one line) should be activated (that is, not commented out) per input data record. We plan to provide a simple graphical user interface to make this editing task faster in a later version of Febrl.

Create an initial HMM with trainhmm.py using the modified training file. Set the HMM smoothing option hmm_smoothing to either None, laplace or absdiscount. Borkar et al. [4] suggest that absolute discounting seems to work best, but we have also had good results with Laplace smoothing.

Create a second, larger training file (with e.g. 1,000 records) again using the tagdata.py program this time set the hmm_file_name to the name of the initial HMM created in the previous step. In this way the initial HMM which you created using just 100 training records will be used to bootstrap the tagging of this second, larger training file. This reduces the amount of work which needs to be done by the person editing the training file, because it is much easier to correct existing states associated with each tag than it is to add states de novo, as was necessary in step 2 above.

Open the second training file and again manually inspect all training records. Comment out incorrect training records, and modify the one closest to the correct sequence appropriately by changing the HMM state names to whatever is most appropriate. Again, only one training sequence should be activated (not commented out) per input data record.

Create a second HMM using the second training file. Set the smoothing option as desired.

Create a third training file by reprocessing your second, corrected training file using the tagdata.py program and setting the hmm_file_name to the name of the second HMM file you just created, plus setting the name of the retag_file_name to the name of the second, corrected training file. Be sure to specify a different output file name for this third training file so that the second training file which you are reprocessing is not overwritten. Set the smoothing option as desired.

Examine this third training file. You will see that records in which the sequence of tag:hmm_state pairs has changed will be marked with '***** Changed'. Examine only the changed records (since the unchanged records you have already verified in previous steps) and correct whose which appear to be incorrect. Repeat the previous three steps until no more changes are detected in the training file. Note that you also wish to edit the correction lists and look-up tables to improve the observation symbol tagging of the training data (as opposed to the hidden state assignment). You should re-tag the training file and retrain the HMM after making such modifications to ensure that the changes have not caused other problems, and to ensure that the HMM reflects the latest version of the tagging look-up tables.

Finally, in your main project module (as derived from an original project.py module), set the name or address HMM file name to the last HMM file you created.

This training cycle can be enhanced in various ways. For example, once a reasonably good HMM training set has been developed, it can be further improved by adding examples of unusual records to it. By definition, such unusual records occur in the input data only infrequently, and thus very large numbers of training records would need to be examined if they were to be found manually. However, by setting the freqs_file_name option in tagdata.py it is possible to obtain a listing of record formats in ascending order of frequency in a particular input data set (that is, the least common record formats are listed first). Typically quite large numbers of records are specified for processing by tagdata.py when the freqs_file_name option is set - 100,000 would not be unusual. Of course, there is no prospect of being able to inspect all 100,000 records in the resulting training file, but records with unusual patterns amongst those 100,000 records can be found at or near the top of the output file specified with the freqs_file_name option. Corrected versions of the tag:hmm_state pair sequences for those unusual records can then be added to the smaller, 1,000 record training file as discussed above, and the HMM then retrained using the trainhmm.py program.