Two different methods are currently implemented in the Febrl system to clean and standardise names. A rules based approach is described in this section and a hidden Markov model based approach in the following section. Both are implemented in the standardisation.py and name.py modules, and both standardise a name into six name output fields as shown in the first column in Table 6.3.
A rules based name standardiser can be initialised as shown in the code example below. The following arguments need to be given.
field_separatoras described below) into one string before the parsing and standardisation is done.
Noneif no output is to be written (for example if one is not interested in the gender guess values), as long as at least one output field is defined (not set to
'mr'), which will be used to guess the gender.
'ms'), which will be used to guess the gender.
'gname'(assuming names start with given names) or
'sname'(assuming names start with the surname first). The default value (if this argument is not given) is
Falsein order to activate or de-activate the word spilling functionality (see Section 6.1.4 for more details). If set to true word spilling between input fields is checked using the name tag look-up table (e.g.
'peter pa','ul miller'will be corrected into
'peter paul miller'). Note that word spilling is only useful if the value of the
field_separatoris not an empty string. The default value for word spilling is
The following example code shows how a rules based standardiser for
names can be initialised. Note that the second output field, i.e. the
gender guess, is set to
None, which means it will not be stored
in the cleaned and standardised data set. Word spilling is activated,
and the field separator is set to be a whitespace.
# ==================================================================== name_rules_std = NameRulesStandardiser(name = 'Name-Standard-Rules', input_fields = ['gname', 'sname'], output_fields = ['title', None, 'given_name', 'alt_given_name', 'surname', 'alt_surname'], name_corr_list = name_correction_list, name_tag_table = name_lookup_table, male_titles = ['mr'], female_titles = ['ms'], field_separator = ' ', check_word_spill = True)