Two different methods are currently implemented in the Febrl system to clean and standardise names. A rules based approach is described in this section and a hidden Markov model based approach in the following section. Both are implemented in the standardisation.py and name.py modules, and both standardise a name into six name output fields as shown in the first column in Table 6.3.
A rules based name standardiser can be initialised as shown in the code example below. The following arguments need to be given.
name
description
input_fields
field_separator
as
described below) into one string before the parsing and
standardisation is done.
output_fields
None
if
no output is to be written (for example if one is not interested in
the gender guess values), as long as at least one output field is
defined (not set to None
).
name_corr_list
name_tag_table
male_titles
'mr'
), which
will be used to guess the gender.
female_titles
'ms'
), which
will be used to guess the gender.
first_name_comp
'gname'
(assuming names start with given names) or
'sname'
(assuming names start with the surname first). The
default value (if this argument is not given) is 'gname'
.
field_separator
' '
.
check_word_spill
True
or False
in order to
activate or de-activate the word spilling functionality (see
Section 6.1.4 for more details). If set to
true word spilling between input fields is checked using the name
tag look-up table (e.g. 'peter pa','ul miller'
will be
corrected into 'peter paul miller'
). Note that word spilling
is only useful if the value of the field_separator
is not an
empty string. The default value for word spilling is True
.
The following example code shows how a rules based standardiser for
names can be initialised. Note that the second output field, i.e. the
gender guess, is set to None
, which means it will not be stored
in the cleaned and standardised data set. Word spilling is activated,
and the field separator is set to be a whitespace.
# ==================================================================== name_rules_std = NameRulesStandardiser(name = 'Name-Standard-Rules', input_fields = ['gname', 'sname'], output_fields = ['title', None, 'given_name', 'alt_given_name', 'surname', 'alt_surname'], name_corr_list = name_correction_list, name_tag_table = name_lookup_table, male_titles = ['mr'], female_titles = ['ms'], field_separator = ' ', check_word_spill = True)