6.3 Name Cleaning and Standardisation using a Rules Based Approach

Two different methods are currently implemented in the Febrl system to clean and standardise names. A rules based approach is described in this section and a hidden Markov model based approach in the following section. Both are implemented in the standardisation.py and name.py modules, and both standardise a name into six name output fields as shown in the first column in Table 6.3.

A rules based name standardiser can be initialised as shown in the code example below. The following arguments need to be given.

name
A name for the name standardiser. Most suitable is a short string.

description
A longer description of the name standardiser. Note that this argument is not mandatory, the standardisation process works fine without a description.

input_fields
A string with a field name or a list of field names from the input data set, that will be standardised. If a list is given the field values will be concatenated (using the field_separator as described below) into one string before the parsing and standardisation is done.

output_fields
A list of six field names as defined in the output data set. The name component standardiser returns standardised names in these six fields, in the sequence as given in Table 6.3. It is possible to set some of these output fields to None if no output is to be written (for example if one is not interested in the gender guess values), as long as at least one output field is defined (not set to None).

name_corr_list
A reference to the name correction list to be used. This list must have been initialised and loaded previously. See Chapter 14 and Section 14.1 for more details.

name_tag_table
A reference to the name tagging table to be used. This table must have been initialised and loaded previously. See Chapter 14 and Section 14.2 for more details.

male_titles
A list of one or more male title words (like 'mr'), which will be used to guess the gender.

female_titles
A list of one or more female title words (like 'ms'), which will be used to guess the gender.

first_name_comp
To give the rules based standardisation system a hint, the component names are most likely to start with (either given- or surnames) need to be given using this argument. The value can be either 'gname' (assuming names start with given names) or 'sname' (assuming names start with the surname first). The default value (if this argument is not given) is 'gname'.

field_separator
When more than one input field is given, they are concatenated and a field separator character (which can be an empty string) is inserted between them. The default value for the field separator is a whitespace ' '.

check_word_spill
A flag that can be set to True or False in order to activate or de-activate the word spilling functionality (see Section 6.1.4 for more details). If set to true word spilling between input fields is checked using the name tag look-up table (e.g. 'peter pa','ul miller' will be corrected into 'peter paul miller'). Note that word spilling is only useful if the value of the field_separator is not an empty string. The default value for word spilling is True.

The following example code shows how a rules based standardiser for names can be initialised. Note that the second output field, i.e. the gender guess, is set to None, which means it will not be stored in the cleaned and standardised data set. Word spilling is activated, and the field separator is set to be a whitespace.

# ====================================================================

name_rules_std = NameRulesStandardiser(name = 'Name-Standard-Rules',
                               input_fields = ['gname',
                                               'sname'],
                              output_fields = ['title',
                                               None,
                                               'given_name',
                                               'alt_given_name',
                                               'surname',
                                               'alt_surname'],
                             name_corr_list = name_correction_list,
                             name_tag_table = name_lookup_table,
                                male_titles = ['mr'],
                              female_titles = ['ms'],
                            field_separator = ' ',
                           check_word_spill = True)