8.2 Program 'trainhmm.py'

Once tagged training data has been created using the program tagdata.py and edited by a user, a hidden Markov model (HMM) can be created using trainhmm.py.

The program is called from the command line with:

python trainhmm.py

All settings are within the module as shown in the code example at the end of this section. The following list describes the different configuration settings that must be defined.

After the module header, all necessary modules are imported, and a project logger is defined. The settings like name of the log file, console and file logging levels, can be modified as described in Chapter 15.

A febrl object is defined next. Generally there is no need to modify this.

The name of the input file containing training data must be defined in 'hmm_train_file'. Such a file can be created using the program tagdata.py as described in Section 8.1.

The trained HMM will be written into a text file with the name as defined in 'hmm_model_file'.

The HMM can be given a name that describes it using the 'hmm_name' option. Not that this name is not the file name.

The component 'hmm_component' should be the same as used to tag the training records in 'hmm_train_file'. Possible values are 'name' or 'address'.

A smoothing method can be chosen using the 'hmm_smoothing' option. Possible values are None for no smoothing, 'laplace' for Laplace smoothing or 'absdiscout' for absolute discount smoothing. Both smoothing methods are described in [4].

The format of the training data input file 'hmm_train_file' must be as follows:

Comment lines start with a hash character ('#'). Blank lines are allowed and are skipped.

Each non-empty line that is not a comment line must contain one training record.

Training records must contain a comma separated sequence of pairs (see the example in Section 8.1)

tag:hmm_state

where the tag is one of the possible tags as listed in Appendix B, and hmm_state is one of the possible states from the state lists in Appendix A (either for the name or the address component). Any unknown tag or state in the training data will result in an error and the program stops.

The following code example shows the part of the trainhmm.py program that needs to be modified by the user according to her or his needs.

# ====================================================================
# Define a project logger

init_febrl_logger(log_file_name = 'febrl-trainhmm.log',
                     file_level = 'WARN',
                  console_level = 'INFO',
                      clear_log = True,
                parallel_output = 'host')

# ====================================================================
# Set up Febrl and create a new project (or load a saved project)

hmm_febrl = Febrl(description = 'HMM training Febrl instance',
                   febrl_path = '.')

hmm_project = hmm_febrl.new_project(name = 'HMM-Train',
                             description = 'Training module for HMMs',
                               file_name = 'hmm.fbr')

# ====================================================================
# Define settings for HMM training

# Name of the file containing training records  - - - - - - - - - - - 
#
hmm_train_file = 'hmm'+dirsep+'address-train.csv'

# Name of the HMM file to be written  - - - - - - - - - - - - - - - -
#
hmm_model_file = 'test-address.hmm'

# Name of the HMM - - - - - - - - - - - - - - - - - - - - - - - - - -
#
hmm_name = 'Test Address HMM'

# Component: Can either be 'name' or 'address'  - - - - - - - - - - -
#
hmm_component = 'address'

# HMM smoothing method, can be either None, 'laplace' or 'absdiscount'
#
hmm_smoothing = 'absdiscount'