Once tagged training data has been created using the program tagdata.py and edited by a user, a hidden Markov model (HMM) can be created using trainhmm.py.
The program is called from the command line with:
python trainhmm.py
All settings are within the module as shown in the code example at the end of this section. The following list describes the different configuration settings that must be defined.
febrl
object is defined next. Generally there is no
need to modify this.
'hmm_train_file'
. Such a file can be created
using the program tagdata.py as described in
Section 8.1.
'hmm_model_file'
.
'hmm_name'
option. Not that this name is not the file
name.
'hmm_component'
should be the same as used
to tag the training records in 'hmm_train_file'
.
Possible values are 'name'
or 'address'
.
'hmm_smoothing'
option. Possible values are None
for no smoothing, 'laplace'
for Laplace smoothing or
'absdiscout'
for absolute discount smoothing. Both
smoothing methods are described in [4].
The format of the training data input file 'hmm_train_file'
must be as follows:
'#'
). Blank
lines are allowed and are skipped.
tag:hmm_state
tag
is one of the possible tags as listed in
Appendix B, and hmm_state
is one
of the possible states from the state lists in
Appendix A (either for the name or the
address component). Any unknown tag or state in the training
data will result in an error and the program stops.
The following code example shows the part of the trainhmm.py program that needs to be modified by the user according to her or his needs.
# ==================================================================== # Define a project logger init_febrl_logger(log_file_name = 'febrl-trainhmm.log', file_level = 'WARN', console_level = 'INFO', clear_log = True, parallel_output = 'host') # ==================================================================== # Set up Febrl and create a new project (or load a saved project) hmm_febrl = Febrl(description = 'HMM training Febrl instance', febrl_path = '.') hmm_project = hmm_febrl.new_project(name = 'HMM-Train', description = 'Training module for HMMs', file_name = 'hmm.fbr') # ==================================================================== # Define settings for HMM training # Name of the file containing training records - - - - - - - - - - - # hmm_train_file = 'hmm'+dirsep+'address-train.csv' # Name of the HMM file to be written - - - - - - - - - - - - - - - - # hmm_model_file = 'test-address.hmm' # Name of the HMM - - - - - - - - - - - - - - - - - - - - - - - - - - # hmm_name = 'Test Address HMM' # Component: Can either be 'name' or 'address' - - - - - - - - - - - # hmm_component = 'address' # HMM smoothing method, can be either None, 'laplace' or 'absdiscount' # hmm_smoothing = 'absdiscount'