6.4 Name Cleaning and Standardisation using a Hidden Markov Model Based Approach

The hidden Markov model approach for name cleaning and standardisation is explained in more details in Chapter 7. Similar to the rules based approach, names are standardised into the six name output fields listed in the first column in Table 6.3.

The initialisation of a HMM based name standardiser is very similar to the rules based standardiser. All arguments used for the rules based standardiser can also be used for the HMM based name standardiser. A new additionally argument that needs to be given is

Assuming the necessary modules have been imported, a hidden Markov model (HMM) can be initialised and loaded as shown in the following code example. A HMM based name standardiser can then be initialised easily. It is also assumed that the febrl.py module has been imported so the directory separator character 'dirsep' is available (as used in the example below).

# ====================================================================

name_states = ['titl','baby','knwn','andor','gname1','gname2','ghyph',
               'gopbr','gclbr','agname1','agname2','coma','sname1',
               'sname2','shyph','sopbr','sclbr','asname1','asname2',
               'pref1','pref2','rubb']

name_tags = ['NU','AN','TI','PR','GF','GM','SN','ST','SP','HY','CO',
             'NE','II','BO','VB','UN','RU']

myname_hmm = hmm('Name HMM', name_states, name_tags)
myname_hmm.load_hmm('hmm'+dirsep+'name-absdiscount.hmm')

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

name_hmm_std = NameHMMStandardiser(name = 'Name-Standard-HMM',
                           input_fields = ['gname',
                                           'sname'],
                          output_fields = ['title',
                                           None,
                                           'given_name',
                                           'alt_given_name',
                                           'surname',
                                           'alt_surname'],
                               name_hmm = myname_hmm,
                         name_corr_list = name_correction_list,
                         name_tag_table = name_lookup_table,
                            male_titles = ['mr'],
                          female_titles = ['ms'],
                        field_separator = ' ',
                       check_word_spill = True)