The program tagdata.py is used to create tagged training records selected from the original data set. Each training record is selected randomly from the input data set, cleaned and tagged in the same way as done in the data cleaning and standardisation process within the standardisation.py module. The tag sequence (or sequences) are written to the training file together with the commented original record.
The program is called from the command line with:
python tagdata.py
All settings are within the program as shown in the code example at the end of this section. The following list describes the different configuration settings that must be defined.
febrl
object is defined next. Generally there is no
need to modify this.
'read'
.
'num_rec_to_select'
records are selected
randomly in the range between (and including) records number
'start_rec_number'
and 'end_rec_number'
. Note
that only records with non-empty name or address component are
selected.
'output_file_name'
. An example of
the format of this output file is given further below.
'tag_component'
can be either 'name'
or
'address'
.
'tag_component_fields'
. Note that
these field names must correspond to fields names as defined in
the input data set 'fields'
.
'check_word_spilling'
should be set
to True
(and the value of 'field_separator'
not
to an empty string ''
), otherwise set this option to
False
.
'field_separator'
character or string is inserted
between them. Note that for word spilling to be activated the
values of this field separator must be different from an empty
string ''
.
'hmm_file_name'
in which case the training records are
tagged as well as standardised using this HMM. This allows a
semi-automatic training process, where the user only has to
inspect the output training file and change HMM states or tags
for cases that are standardised incorrectly. This mechanism
reduces the time needed to create enough records to train a HMM
training. If no HMM standardisation is desired, set the value of
'hmm_file_name'
to None
.
'retag_file_name'
to the
name of an existing training file. This is useful when some
adjustments have been made to some tagging look-up tables or
correction lists. Note that the re-tagged file will be written
to the output file name. Re-tagging is only possible if a HMM
file is defined. If no re-tagging should be done, set the value
of 'retag_file_name'
to None
.
'freqs_file_name'
option it is possible
to compile and write the frequencies of all
tag:hmm_state
pair sequences in ascending order into the
file with the defined name. This is useful for finding examples
of unusual patterns of names or addresses which might need to be
added to the training file(s). This can only be used together
with a HMM defined.
If the option 'hmm_file_name'
is defined (set to the name of
a HMM file), the selected training records are given both tags and
(hidden) states, as tag:hmm_state
pairs, using the HMM. This
allows a semi-automatic training process, where the user only has to
inspect the output file and change HMM states for cases that are
standardised incorrectly. This mechanism reduces the time needed to
create enough records to train a HMM training.
The format of the output file is as follows. The selected original
input records (name or address component) are written to the output
file as comment lines with a hash '#'
character and the line
number from the input file (starting with zero) at the beginning of a
line. After each input data line, one or more lines with tag sequences
follows.
The user has to manually inspect the output file and delete (or comment out) all lines with tags that are not correct, and insert a HMM state name for each observation tag in a sequence (or modify the HMM state given).
For example, if we have the three selected input records (name component)
'dr peter baxter dea'
'miss monica mitchell meyer'
'phd tim william jones harris'
they will be processed (depending on the available look-up tables) and written into the output training file as
# 0: |dr peter baxter dea|
TI:, GM:, GM:, GF:
TI:, GM:, SN:, GF:
TI:, GM:, GM:, SN:
TI:, GM:, SN:, SN:
# 1: |miss monica mitchell meyer|
TI:, UN:, GM:, SN:
TI:, UN:, SN:, SN:
# 2: |phd tim william jones harris|
TI:, GM:, GM:, UN:, SN:
If a HMM file is defined in option 'hmm_file_name'
the output
will be something like (again depending on the available look-up
tables)
# 0: |dr peter baxter dea|
# TI:titl, GM:gname1, GM:gname2, GF:sname1
# TI:titl, GM:gname1, SN:sname1, GF:sname2
TI:titl, GM:gname1, GM:gname2,
SN:sname1
# TI:titl, GM:gname1, SN:sname1, SN:sname2
# 1: |miss monica mitchell meyer|
TI:titl, UN:gname1, GM:sname1,
SN:sname2
# TI:titl, UN:gname1, SN:sname1, SN:sname2
# 2: |phd tim william jones harris|
TI:titl, GM:gname1, GM:gname2, UN:sname1,
SN:sname2
The following code example shows the part of the tagdata.py program that needs to be modified by the user according to her or his needs.
# ==================================================================== # Define a project logger init_febrl_logger(log_file_name = 'febrl-tagdata.log', file_level = 'WARN', console_level = 'INFO', clear_log = True, parallel_output = 'host') # ==================================================================== # Set up Febrl and create a new project (or load a saved project) tag_febrl = Febrl(description = 'Data tagging Febrl instance', febrl_path = '.') tag_project = tag_febrl.new_project(name = 'Tag-Data', description = 'Data tagging module', file_name = 'tag.fbr') # ==================================================================== # Define settings for data tagging # Define your original input data set - - - - - - - - - - - - - - - - # input_data = DataSetCSV(name = 'example1in', description = 'Example data set 1', access_mode = 'read', header_lines = 1, file_name = 'dsgen'+dirsep+'dataset1.csv', fields = {'rec_id':0, 'given_name':1, 'surname':2, 'street_num':3, 'address_part_1':4, 'address_part_2':5, 'suburb':6, 'postcode':7, 'state':8, 'date_of_birth':9, 'soc_sec_id':10}, fields_default = '', strip_fields = True, missing_values = ['','missing']) # Define block of records to be used for tagging - - - - - - - - - - - # start_rec_number = 0 end_rec_number = 1000 # input_data.num_records # Define number of records to be selected randomly - - - - - - - - - - # num_rec_to_select = 500 # Define name of output data set - - - - - - - - - - - - - - - - - - - # output_file_name = 'tagged-data.csv'
# Component: Can either be 'name' or 'address' - - - - - - - - - - - - # tag_component = 'address' # Define a list with field names from the input data set in the - - - # component (name or address - name in this example) # tag_component_fields = ['street_num', 'address_part_1', 'address_part_2', 'suburb', 'postcode', 'state'] # Define if word spilling should be checked or not - - - - - - - - - # check_word_spilling = True # Set to True or False # Define the field separator - - - - - - - - - - - - - - - - - - - - - # field_separator = ' ' # Use HMM for tagging and segmenting - - - - - - - - - - - - - - - - # (set to address of a HMM file or None) # hmm_file_name = 'hmm'+dirsep+'address-absdiscount.hmm' # Retag an existing training file - - - - - - - - - - - - - - - - - - # - Note that re-tagging is only possible if a HMM file name is given # as well # - If the retag file name is defined, the start and end record # numbers as defined above are not used, instead the record numbers # in the re tag file are used. # retag_file_name = None # Set to name of an existing training file or # to None # Write out frequencies into a file - - - - - - - - - - - - - - - - - # freqs_file_name = 'tagged-data-freqs.txt' # Set to a file name or None # Define and load lookup tables - - - - - - - - - - - - - - - - - - - # name_lookup_table = TagLookupTable(name = 'Name lookup table', default = '') name_lookup_table.load(['data'+dirsep+'givenname_f.tbl', 'data'+dirsep+'givenname_m.tbl', 'data'+dirsep+'name_prefix.tbl', 'data'+dirsep+'name_misc.tbl', 'data'+dirsep+'saints.tbl', 'data'+dirsep+'surname.tbl', 'data'+dirsep+'title.tbl']) name_correction_list = CorrectionList(name = 'Name correction list') name_correction_list.load('data'+dirsep+'name_corr.lst') address_lookup_table = TagLookupTable(name = 'Address lookup table', default = '') address_lookup_table.load(['data'+dirsep+'country.tbl', 'data'+dirsep+'address_misc.tbl', 'data'+dirsep+'address_qual.tbl', 'data'+dirsep+'institution_type.tbl', 'data'+dirsep+'locality_name_act.tbl', 'data'+dirsep+'locality_name_nsw.tbl', 'data'+dirsep+'post_address.tbl', 'data'+dirsep+'postcode_act.tbl', 'data'+dirsep+'postcode_nsw.tbl', 'data'+dirsep+'saints.tbl', 'data'+dirsep+'territory.tbl', 'data'+dirsep+'unit_type.tbl', 'data'+dirsep+'wayfare_type.tbl']) address_correction_list = CorrectionList(name = 'Address corr. list') address_correction_list.load('data'+dirsep+'address_corr.lst')