The hidden Markov model approach for name cleaning and standardisation is explained in more details in Chapter 7. Similar to the rules based approach, names are standardised into the six name output fields listed in the first column in Table 6.3.
The initialisation of a HMM based name standardiser is very similar to the rules based standardiser. All arguments used for the rules based standardiser can also be used for the HMM based name standardiser. A new additionally argument that needs to be given is
name_hmm
Assuming the necessary modules have been imported, a hidden Markov
model (HMM) can be initialised and loaded as shown in the following
code example. A HMM based name standardiser can then be initialised
easily. It is also assumed that the febrl.py module has
been imported so the directory separator character 'dirsep'
is available (as used in the example below).
# ==================================================================== name_states = ['titl','baby','knwn','andor','gname1','gname2','ghyph', 'gopbr','gclbr','agname1','agname2','coma','sname1', 'sname2','shyph','sopbr','sclbr','asname1','asname2', 'pref1','pref2','rubb'] name_tags = ['NU','AN','TI','PR','GF','GM','SN','ST','SP','HY','CO', 'NE','II','BO','VB','UN','RU'] myname_hmm = hmm('Name HMM', name_states, name_tags) myname_hmm.load_hmm('hmm'+dirsep+'name-absdiscount.hmm') # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - name_hmm_std = NameHMMStandardiser(name = 'Name-Standard-HMM', input_fields = ['gname', 'sname'], output_fields = ['title', None, 'given_name', 'alt_given_name', 'surname', 'alt_surname'], name_hmm = myname_hmm, name_corr_list = name_correction_list, name_tag_table = name_lookup_table, male_titles = ['mr'], female_titles = ['ms'], field_separator = ' ', check_word_spill = True)