After an input component string has been cleaned, the next step is to
split it at space boundaries into a list of words, numbers and
possible separators. The name input 'doctor peter paul miller'
for example is split into a list containing the four words
['doctor', 'peter', 'paul', 'miller']
. All leading and trailing
spaces are removed from the list elements.
Using various look-up tables (and some hard-coded rules), each element of this list is then assigned one or more tags. The list of possible tags can be found in Appendix B. The hard-coded rules include, for example, tagging an element as a hyphen, a comma, a slash, a number or an alphanumeric word, while most of the other tags (titles, given names, surnames, postcode, locality names, wayfare and unit types, countries, etc.) are assigned to words if they are listed in one of the look-up tables provided. If a word (or a word sequence) is found in a look-up table, it is not only tagged, but it is also replaced by it's corresponding corrected entry in the look-up table.
It is possible that a word is listed in more than one look-up table.
Consequently, it will be assigned more than one tag (see for example
the name word 'peter'
below). Words which are not found in any
look-up table and which do not match any of the hard-coded tagging
rules are assigned the 'UN'
(unknown) tag. A title word like
'doctor'
for example is assigned a title tag 'TI'
and it
will be replaced with the word 'dr'
, as are the words
'md'
and 'phd'
(using the example look-up table shown in
Table 6.2).
The look-up tables are searched using a greedy matching
algorithm, which searches for the longest
tuple of elements which match an entry in the look-up tables. For
example, the tuple of words ('macquarie','fields')
will be
matched with an entry in a look-up table for the locality
'macquarie fields'
, rather than with the shorter entry
'macquarie'
from the same look-up table. As another example,
'st marys'
is tagged as 'LN'
(locality name) and
replaced with the string 'st_marys'
, rather than the
'st'
part of 'st marys'
being tagged as 'WT'
(wayfare type) and being replaced with 'street'
, and
'mary'
being tagged as 'UN'
(assuming this word is not
found in any look-up tables for address words).
While the input to a tagging routine is a cleaned string, the output is a list of elements and the corresponding list of tags. For the example input name string
'doctor peter paul miller'
['dr', 'peter', 'paul', 'miller']
['TI', 'GM/SN', 'GM', 'SN' ]
'peter'
is listed in both the look-up tables for
male given names ('GM'
tag) and surnames ('SN'
tag).