12.2 Data Set Generator Program 'generate.py'

As record linkage and deduplication is often dealing with data sets that contain partially identified data (like names and addresses) it can be difficult to acquire data for testing and evaluation of new linkage algorithms and techniques. It is also hard to learn how to use and customise record linkage systems effectively without data sets where the linkage status of record pairs is known.

There is clearly a lack of publicly available real data sets for deduplication and record linkage, which can be used as a standard test bed for developing and comparing algorithms (similar to standard data sets used in information retrieval or machine learning). However, due to privacy and confidentiality issues it is unlikely that such data will ever become publicly available. De-identified data unfortunately cannot be used as the real values of names and addresses, for example, are at the core of the linkage process.

An alternative is the use of artificially generated data sets. They
have the advantages that the amount of errors introduced, as well as
the linkage status of record pairs, are known. Controlled experiments
can be performed and replicated easily. A first such database
generator (*DBGen*) was presented by [17] and has been
used by others in a number of studies. However, to our knowledge it is
not publicly available. This generator allows the creation of
databases containing duplicates records. It uses lists of names,
cities, states, and postcodes (all from the USA), and provides a large
number of parameters, including the size of the database to be
generated, percentage and distribution of duplicates, and the amount
and types of errors introduced.

We have developed a data set generator based on ideas by
Hernandez [17] and improved in various ways. This generator
can create data sets that contain names (based on frequency tables for
surnames and given names), addresses (based on frequency tables for
localities, postcodes, street number and addresses, and state or
territory names), dates (like dates of birth), phone numbers, and
identifier numbers (like social security numbers). Frequency tables
for name and address values are taken from telephone directories.
Additionally, user supplied look-up tables with real world spelling
variations (i.e. typographical errors and misspellings) of a large
number of names (currently 656 given names, 302 surnames, and 931
locality names) is included to provide common real world spelling
variations. The user can control the maximum number of errors
introduced per field and per record. User provided parameters also
include the number of *original* and *duplicate* records to
be created, the maximum number of duplicates for one original record
(according to several probability distributions), and the
probabilities for introducing different single character errors to
create the duplicate records (like inserting, deleting, transposing
and substituting characters; swapping a field value with another value
from the same look-up table; inserting or deleting spaces; setting a
field value to missing; or swapping the values of two fields). The
positions of where errors are introduced, as well as the types of
errors introduced, are modelled according to real world studies on
typographical and related errors [20,29].

This data set generator and all its associated files (frequency files,
a `README.txt`

documentation, and several example data sets) are
part of the **Febrl** distribution and are available in the
directory `dsgen`

within the main **Febrl** directory.
Please read the `README.txt`

documentation in `dsgen`

for
more information on the content of this directory.

The data set generation process works as follows. In the first step
the desired number of *original* records are generated, and in
the second step *duplicates* of these original records are
created using randomly introduced modifications. Each created record
is given a unique identifier, which allows the evaluation of error
rates (false linked non-duplicates and non-linked true duplicates).

Duplicates are created by randomly introducing modifications (with user definable probabilities) of the following forms:

- If an original field value is available in the corresponding misspellings look-up table, randomly select one of it's misspellings.
- Insert a new character at a random position into a field value.
- Delete a character at a random position from a field value.
- Substitute a character in a field value with another character (with the possibility to set probabilities for randomly choosing a neighbouring key in the same keyboard row or column).
- Transpose two characters at a random position in a field value.
- Swap (replace) the value in a field with another value (randomly selected from a frequency look-up table of possible values).
- Insert a space into a field value and splitting a word.
- Delete a space (if available) in a field value and merging two words.
- Set a field value to missing (with a user definable missing value).
- Insert a new field value given the field was originally empty.
- Swap the values of two fields (e.g. the surname with the given name values) in one record.

The `generate.py` program can be started from the command line
with the following argument list

`python generate.py`

max_modification_per_field max_modification_per_record distribution

The needed arguments are

*output_file*

The name of the output file. Currently this is a comma separated values (CSV) text file.*num_records*

The number of*original*records to be generated. These records are given a unique record identifier of the form`'rec-`

*X*`-org'`

with*X*being a record number starting from 0.*num_duplicates*

The number of*duplicate*records to be generated. These records are given a unique record identifier of the form`'rec-`

*X*`-dup-`

*Y*`'`

with*X*being the record number of the original record upon which a duplicate is based on, and*Y*being a duplicate sequence number (starting from 0), i.e. if more than one duplicate record is created based on one original record, these duplicates are numbered from 0 onwards.*max_duplicate_per_record*

The maximal number of duplicate records that will be created based on one original record. The distribution of how many duplicates are being created for an original record is given with the argument*distribution*.*max_modification_per_field*

The maximum number of modifications that are introduced into a single record field.*max_modification_per_record*

The maximum number of modifications that are introduced into a record. This number must be equal to or larger than the maximal number of modifications per field*distribution*

The probability distribution used to create duplicate records (i.e. the number of duplicates for one original record). Possible distributions are:`uniform`

,`poisson`

, or`zipf`

.

Many more parameters (like the distribution of how fields are selected
for introduction of modifications, all the different probabilities of
how modifications are introduced, and the names of the frequency
look-up table files to be used for names and addresses) have to be
edited within the `generate.py` module itself as shown in the
following code section given below (which is directly taken from
`generate.py`, with only the license header, module header
documentation, and imports being removed here).

# ==================================================================== # Set this flag to True for verbose output, otherwise to False # VERBOSE_OUTPUT = True

# ==================================================================== # For each field (attribute), a dictionary has to be defined with the # following keys (probabilities can have values between 0.0 and 1.0, # or they can be missing - in which case it is assumed they have a # value of 0.0): # - name The field name to be used when a header is written # into the output file. # - type The type of the field. Possible are: # 'freq' (for fields that use a frequency table with # field values). # 'date' (for date fields in a certain range). # 'phone' (for phone numbers). # 'ident' (for numerical identifier fields in a # certain range). # - char_range The range of random characters that can be # introduced. Can be one of 'alpha', 'digit', or # 'alphanum'. # # For fields of type 'freq' the following keys must be given: # - freq_file The name of a frequency file. # - misspell_file The name of a misspellings file. # # For fields of type 'date' the following keys must be given: # - start_date A start date, must be a tuple (day,month,year). # - end_date A end date, must be a tuple (day,month,year). # # For fields of type 'phone' the following keys must be given: # - area_codes A list with possible area codes (as strings). # - num_digits The number of digits in the phone numbers (without # the area code). # # For fields of type 'ident' the following keys must be given: # - start_id A start identification number. # - end_id An end identification number. # # For all fields the following keys must be given: # - select_prob Probability of selecting a field for introducing # one or more modifications (set this to 0.0 if no # modifications should be introduced into this field # ever). Note: The sum of these select probabilities # over all defined fields must be 100. # - misspell_prob Probability to swap an original value with a # randomly chosen misspelling from the corresponding # misspelling dictionary (can only be set to larger # than 0.0 if such a misspellings dictionary is # defined for the given field). # - ins_prob Probability to insert a character into a field. # - del_prob Probability to delete a character from a field. # - sub_prob Probability to substitute a character in a field # with another character. # - trans_prob Probability to transpose two characters in a field. # - val_swap_prob Probability to swap the value in a field with # another (randomly selected) value for this field # (taken from this field's look-up table). # - wrd_swap_prob Probability to swap two words in a field (given # there are at least two words in a field).

# - spc_ins_prob Probability to insert a space into a field value # (thus splitting a word). # - spc_del_prob Probability to delete a space (if available) in a # field (and thus merging two words). # - miss_prob Probability to set a field value to missing (empty). # - new_val_prob Probability to insert a new value given the # original value was empty. # # Note: The sum over the probabilities ins_prob, del_prob, sub_prob, # trans_prob, val_swap_prob, wrd_swap_prob, spc_ins_prob, # spc_del_prob, and miss_prob for each defined field must be # 1.0; or 0.0 if no modification should be done at all on a # given field. # # ==================================================================== # Comments about typographical errors and misspellings found in the # literature: # # Damerau 1964: - 80% are single errors: insert, delete, substitute or # transpose # - Statistic given: 567/964 (59%) substitutions # 153/964 (16%) deletions # 23/964 ( 2%) transpositions # 99/964 (10%) insertions # 122/964 (13%) multiple errors # # Hall 1980: - OCR and other automatic devices introduce similar # errors of substitutions, deletions and insertions, but # not transpositions; frequency and type of errors are # characteristics of the device. # # Pollock/Zamora 1984: - OCR output contains almost exclusively # substitution errors which ordinarily account # for less than 20% of key boarded misspellings # - 90-95% of misspellings in raw keyboarding # typically only contain one error. # - Only 7.8% of the first letter of misspellings # were incorrect, compared to 11.7% of the # second and 19.2% of the third. # - Generally assumed that vowels are less # important than consonants. # - The frequency of a misspelling seems to be # determined more by the frequency of it's # parent word than by the difficulty of # spelling it. # - Most errors are mechanical (typos), not the # result of poor spelling. # - The more frequent a letter, the more likely # it is to be miskeyed. # - Deletions are similar frequent than trans- # positions, but more frequent than insertions # and again more frequent than substitutions. # # Pollock/Zamora 1983: - Study of 50,000 nonword errors, 3-4 character # misspellings constitute only 9.2% of total # misspellings, but they generate 40% of # miscorrections. #

# Peterson 1986: In two studies: # - Transpose two letters: 2.6% / 13.1% # - Insert letter: 18.7% / 20.3% # - Delete letter: 31.6% / 34.4% # - Substitute letter: 40.0% / 26.9% # # Kukich 1990: - Over 63% of errors in TDD conversations occur in # words of length 2, 3 or 4. # # Kukich 1992: - 13% of non-word spelling errors in a 40,000 corpus of # typed conversations involved merging of two words, 2% # splitting a word (often at valid forms, "forgot" -> # "for got"). # - Most misspellings seem to be within two characters in # length of the correct spelling. # # ==================================================================== # Other comments: # # - Intuitively, one can assume that long and unfrequent words are # more likely to be misspelt than short and common words. # # ==================================================================== givenname_dict = {'name':'given_name', 'type':'freq', 'char_range':'alpha', 'freq_file':'data'+os.sep+'givenname-freq.csv', 'select_prob':0.10, 'misspell_file':'data'+os.sep+'givenname-misspell.tbl', 'misspell_prob':0.30, 'ins_prob':0.05, 'del_prob':0.15, 'sub_prob':0.35, 'trans_prob':0.05, 'val_swap_prob':0.02, 'wrd_swap_prob':0.02, 'spc_ins_prob':0.01, 'spc_del_prob':0.01, 'miss_prob':0.02, 'new_val_prob':0.02} surname_dict = {'name':'surname', 'type':'freq', 'char_range':'alpha', 'freq_file':'data'+os.sep+'surname-freq.csv', 'select_prob':0.15, 'misspell_file':'data'+os.sep+'surname-misspell.tbl', 'misspell_prob':0.30, 'ins_prob':0.10, 'del_prob':0.10, 'sub_prob':0.35, 'trans_prob':0.04, 'val_swap_prob':0.02, 'wrd_swap_prob':0.02, 'spc_ins_prob':0.01, 'spc_del_prob':0.02, 'miss_prob':0.02, 'new_val_prob':0.02}

streetnumber_dict = {'name':'street_number', 'type':'freq', 'char_range':'digit', 'freq_file':'data'+os.sep+'streetnumber-freq.csv', 'select_prob':0.10, 'ins_prob':0.10, 'del_prob':0.15, 'sub_prob':0.60, 'trans_prob':0.05, 'val_swap_prob':0.05, 'wrd_swap_prob':0.01, 'spc_ins_prob':0.00, 'spc_del_prob':0.00, 'miss_prob':0.02, 'new_val_prob':0.02} address1_dict = {'name':'address_1', 'type':'freq', 'char_range':'alpha', 'freq_file':'data'+os.sep+'address1-freq.csv', 'select_prob':0.10, 'ins_prob':0.10, 'del_prob':0.15, 'sub_prob':0.55, 'trans_prob':0.05, 'val_swap_prob':0.02, 'wrd_swap_prob':0.03, 'spc_ins_prob':0.02, 'spc_del_prob':0.03, 'miss_prob':0.04, 'new_val_prob':0.01} # Address 2 contains property and institution names - only use rarely # (set missing probability to a high value). # address2_dict = {'name':'address_2', 'type':'freq', 'char_range':'alpha', 'freq_file':'data'+os.sep+'address2-freq.csv', 'select_prob':0.10, 'ins_prob':0.04, 'del_prob':0.04, 'sub_prob':0.10, 'trans_prob':0.02, 'val_swap_prob':0.03, 'wrd_swap_prob':0.04, 'spc_ins_prob':0.02, 'spc_del_prob':0.01, 'miss_prob':0.60, 'new_val_prob':0.10} suburb_dict = {'name':'suburb', 'type':'freq', 'char_range':'alpha', 'freq_file':'data'+os.sep+'suburb-freq.csv', 'select_prob':0.10, 'misspell_file':'data'+os.sep+'suburb-misspell.tbl', 'misspell_prob':0.40,

'ins_prob':0.10, 'del_prob':0.15, 'sub_prob':0.22, 'trans_prob':0.04, 'val_swap_prob':0.01, 'wrd_swap_prob':0.02, 'spc_ins_prob':0.02, 'spc_del_prob':0.02, 'miss_prob':0.01, 'new_val_prob':0.01} postcode_dict = {'name':'postcode', 'type':'freq', 'char_range':'digit', 'freq_file':'data'+os.sep+'postcode-freq.csv', 'select_prob':0.05, 'ins_prob':0.00, 'del_prob':0.00, 'sub_prob':0.35, 'trans_prob':0.60, 'val_swap_prob':0.03, 'wrd_swap_prob':0.00, 'spc_ins_prob':0.00, 'spc_del_prob':0.00, 'miss_prob':0.01, 'new_val_prob':0.01} state_dict = {'name':'state', 'type':'freq', 'char_range':'alpha', 'freq_file':'data'+os.sep+'state-freq.csv', 'select_prob':0.05, 'ins_prob':0.10, 'del_prob':0.10, 'sub_prob':0.55, 'trans_prob':0.02, 'val_swap_prob':0.03, 'wrd_swap_prob':0.00, 'spc_ins_prob':0.00, 'spc_del_prob':0.00, 'miss_prob':0.10, 'new_val_prob':0.10} dob_dict = {'name':'date_of_birth', 'type':'date', 'char_range':'digit', 'start_date':(01,01,1900), 'end_date':(31,12,1999), 'select_prob':0.10, 'ins_prob':0.00, 'del_prob':0.00, 'sub_prob':0.50, 'trans_prob':0.30, 'val_swap_prob':0.05, 'wrd_swap_prob':0.00, 'spc_ins_prob':0.00, 'spc_del_prob':0.00, 'miss_prob':0.10, 'new_val_prob':0.05}

age_dict = {'name':'age', 'type':'freq', 'char_range':'digit', 'freq_file':'data'+os.sep+'age-freq.csv', 'select_prob':0.05, 'ins_prob':0.00, 'del_prob':0.00, 'sub_prob':0.30, 'trans_prob':0.20, 'val_swap_prob':0.20, 'wrd_swap_prob':0.00, 'spc_ins_prob':0.00, 'spc_del_prob':0.00, 'miss_prob':0.20, 'new_val_prob':0.10} phonenum_dict = {'name':'phone_number', 'type':'phone', 'char_range':'digit', 'area_codes':['02','03','04','07','08'], 'num_digits':8, 'select_prob':0.05, 'ins_prob':0.00, 'del_prob':0.00, 'sub_prob':0.40, 'trans_prob':0.30, 'val_swap_prob':0.15, 'wrd_swap_prob':0.00, 'spc_ins_prob':0.00, 'spc_del_prob':0.00, 'miss_prob':0.05, 'new_val_prob':0.10} ssid_dict = {'name':'soc_sec_id', 'type':'ident', 'char_range':'digit', 'start_id':1000000, 'end_id':9999999, 'select_prob':0.05, 'ins_prob':0.00, 'del_prob':0.00, 'sub_prob':0.50, 'trans_prob':0.40, 'val_swap_prob':0.10, 'wrd_swap_prob':0.00, 'spc_ins_prob':0.00, 'spc_del_prob':0.00, 'miss_prob':0.00, 'new_val_prob':0.00}

# Create a field which can be used for blocking (generate values 0 to # 9 which are not modified in duplicates). # blocking_dict = {'name':'blocking_number', 'type':'ident', 'char_range':'digit', 'start_id':0, 'end_id':10, 'select_prob':0.00, 'ins_prob':0.00, 'del_prob':0.00, 'sub_prob':0.00, 'trans_prob':0.00, 'val_swap_prob':0.00, 'wrd_swap_prob':0.00, 'spc_ins_prob':0.00, 'spc_del_prob':0.00, 'miss_prob':0.00, 'new_val_prob':0.00} # -------------------------------------------------------------------- # Probabilities (between 0.0 and 1.0) for swapping values between two # fields. Use field names as defined in the field directories (keys # 'name'). field_swap_prob = {('address_1', 'address_2'):0.02, ('given_name', 'surname'): 0.05, ('postcode', 'suburb'): 0.01} # -------------------------------------------------------------------- # Probabilities (between 0.0 and 1.0) for creating a typographical # error (a new character) in the same row or the same column. This is # used in the random selection of a new character in the 'sub_prob' # (substitution of a character in a field). single_typo_prob = {'same_row':0.40, 'same_col':0.30} # -------------------------------------------------------------------- # Now add all field dictionaries into a list according to how they # should be saved in the output file. field_list = [givenname_dict, surname_dict, streetnumber_dict, address1_dict, address2_dict, suburb_dict, postcode_dict, state_dict, dob_dict, age_dict, phonenum_dict, ssid_dict, blocking_dict] # -------------------------------------------------------------------- # Flag for writing a header line (keys 'name' of field dictionaries) save_header = True # Set to 'False' if no header should be written # -------------------------------------------------------------------- # String to be inserted for missing values missing_value = '' # ====================================================================

Let's make an example and create a very small data set with ten
original records and ten duplicate records (with a maximal of 4
duplicate records being created using one original record, and
maximal two modifications per field and per record). A
`poisson`

distribution should be used for the creation of
duplicates, and the resulting 20 records should be written into the
comma separated text file `mydata.csv`

. The following call of
`generate.py` will create such a data set.

`python generate.py mydata.csv 10 10 4 2 2 poisson`

The output file `mydata.csv`

will contain a header line with the
field names (as defined with the key `'name'`

in the field
dictionaries as given above), followed by the 20 records in an
unsorted (random) sequence.

The content of `mydata.csv`

is given on the following page. For
easier viewing the records have been sorted (using the Unix command
`sort`

) so that original and duplicate records are listed
together.