6.3 Name Cleaning and Standardisation using a Rules Based Approach

Two different methods are currently implemented in the Febrl system to clean and standardise names. A rules based approach is described in this section and a hidden Markov model based approach in the following section. Both are implemented in the standardisation.py and name.py modules, and both standardise a name into six name output fields as shown in the first column in Table 6.3.

A rules based name standardiser can be initialised as shown in the code example below. The following arguments need to be given.

The following example code shows how a rules based standardiser for names can be initialised. Note that the second output field, i.e. the gender guess, is set to None, which means it will not be stored in the cleaned and standardised data set. Word spilling is activated, and the field separator is set to be a whitespace.

# ====================================================================

name_rules_std = NameRulesStandardiser(name = 'Name-Standard-Rules',
                               input_fields = ['gname',
                                               'sname'],
                              output_fields = ['title',
                                               None,
                                               'given_name',
                                               'alt_given_name',
                                               'surname',
                                               'alt_surname'],
                             name_corr_list = name_correction_list,
                             name_tag_table = name_lookup_table,
                                male_titles = ['mr'],
                              female_titles = ['ms'],
                            field_separator = ' ',
                           check_word_spill = True)