12.2 Data Set Generator Program 'generate.py'

As record linkage and deduplication is often dealing with data sets that contain partially identified data (like names and addresses) it can be difficult to acquire data for testing and evaluation of new linkage algorithms and techniques. It is also hard to learn how to use and customise record linkage systems effectively without data sets where the linkage status of record pairs is known.

There is clearly a lack of publicly available real data sets for deduplication and record linkage, which can be used as a standard test bed for developing and comparing algorithms (similar to standard data sets used in information retrieval or machine learning). However, due to privacy and confidentiality issues it is unlikely that such data will ever become publicly available. De-identified data unfortunately cannot be used as the real values of names and addresses, for example, are at the core of the linkage process.

An alternative is the use of artificially generated data sets. They have the advantages that the amount of errors introduced, as well as the linkage status of record pairs, are known. Controlled experiments can be performed and replicated easily. A first such database generator (DBGen) was presented by [17] and has been used by others in a number of studies. However, to our knowledge it is not publicly available. This generator allows the creation of databases containing duplicates records. It uses lists of names, cities, states, and postcodes (all from the USA), and provides a large number of parameters, including the size of the database to be generated, percentage and distribution of duplicates, and the amount and types of errors introduced.

We have developed a data set generator based on ideas by Hernandez [17] and improved in various ways. This generator can create data sets that contain names (based on frequency tables for surnames and given names), addresses (based on frequency tables for localities, postcodes, street number and addresses, and state or territory names), dates (like dates of birth), phone numbers, and identifier numbers (like social security numbers). Frequency tables for name and address values are taken from telephone directories. Additionally, user supplied look-up tables with real world spelling variations (i.e. typographical errors and misspellings) of a large number of names (currently 656 given names, 302 surnames, and 931 locality names) is included to provide common real world spelling variations. The user can control the maximum number of errors introduced per field and per record. User provided parameters also include the number of original and duplicate records to be created, the maximum number of duplicates for one original record (according to several probability distributions), and the probabilities for introducing different single character errors to create the duplicate records (like inserting, deleting, transposing and substituting characters; swapping a field value with another value from the same look-up table; inserting or deleting spaces; setting a field value to missing; or swapping the values of two fields). The positions of where errors are introduced, as well as the types of errors introduced, are modelled according to real world studies on typographical and related errors [20,29].

This data set generator and all its associated files (frequency files, a README.txt documentation, and several example data sets) are part of the Febrl distribution and are available in the directory dsgen within the main Febrl directory. Please read the README.txt documentation in dsgen for more information on the content of this directory.

The data set generation process works as follows. In the first step the desired number of original records are generated, and in the second step duplicates of these original records are created using randomly introduced modifications. Each created record is given a unique identifier, which allows the evaluation of error rates (false linked non-duplicates and non-linked true duplicates).

Duplicates are created by randomly introducing modifications (with user definable probabilities) of the following forms:

The generate.py program can be started from the command line with the following argument list

python generate.py  output_file   num_records   num_duplicates   max_duplicate_per_record
                          max_modification_per_field   max_modification_per_record   distribution

The needed arguments are

Many more parameters (like the distribution of how fields are selected for introduction of modifications, all the different probabilities of how modifications are introduced, and the names of the frequency look-up table files to be used for names and addresses) have to be edited within the generate.py module itself as shown in the following code section given below (which is directly taken from generate.py, with only the license header, module header documentation, and imports being removed here).

# ====================================================================
# Set this flag to True for verbose output, otherwise to False
#
VERBOSE_OUTPUT = True
# ====================================================================
# For each field (attribute), a dictionary has to be defined with the
# following keys (probabilities can have values between 0.0 and 1.0,
# or they can be missing - in which case it is assumed they have a
# value of 0.0):
# - name           The field name to be used when a header is written
#                  into the output file.
# - type           The type of the field. Possible are:
#                  'freq'  (for fields that use a frequency table with
#                           field values).
#                  'date'  (for date fields in a certain range).
#                  'phone' (for phone numbers).
#                  'ident' (for numerical identifier fields in a
#                           certain range).
# - char_range     The range of random characters that can be
#                  introduced. Can be one of 'alpha', 'digit', or
#                  'alphanum'.
#
# For fields of type 'freq' the following keys must be given:
# - freq_file      The name of a frequency file.
# - misspell_file  The name of a misspellings file.
#
# For fields of type 'date' the following keys must be given:
# - start_date     A start date, must be a tuple (day,month,year).
# - end_date       A end date, must be a tuple (day,month,year).
#
# For fields of type 'phone' the following keys must be given:
# - area_codes     A list with possible area codes (as strings).
# - num_digits     The number of digits in the phone numbers (without
#                  the area code).
#
# For fields of type 'ident' the following keys must be given:
# - start_id       A start identification number.
# - end_id         An end identification number.
#
# For all fields the following keys must be given:
# - select_prob    Probability of selecting a field for introducing
#                  one or more modifications (set this to 0.0 if no
#                  modifications should be introduced into this field
#                  ever). Note: The sum of these select probabilities
#                  over all defined fields must be 100.
# - misspell_prob  Probability to swap an original value with a
#                  randomly chosen misspelling from the corresponding
#                  misspelling dictionary (can only be set to larger
#                  than 0.0 if such a misspellings dictionary is
#                  defined for the given field).
# - ins_prob       Probability to insert a character into a field.
# - del_prob       Probability to delete a character from a field.
# - sub_prob       Probability to substitute a character in a field
#                  with another character.
# - trans_prob     Probability to transpose two characters in a field.
# - val_swap_prob  Probability to swap the value in a field with
#                  another (randomly selected) value for this field
#                  (taken from this field's look-up table).
# - wrd_swap_prob  Probability to swap two words in a field (given
#                  there are at least two words in a field).
# - spc_ins_prob   Probability to insert a space into a field value
#                  (thus splitting a word).
# - spc_del_prob   Probability to delete a space (if available) in a
#                  field (and thus merging two words).
# - miss_prob      Probability to set a field value to missing (empty).
# - new_val_prob   Probability to insert a new value given the
#                  original value was empty.
#
# Note: The sum over the probabilities ins_prob, del_prob, sub_prob,
#       trans_prob, val_swap_prob, wrd_swap_prob, spc_ins_prob,
#       spc_del_prob, and miss_prob for each defined field must be
#       1.0; or 0.0 if no modification should be done at all on a
#       given field.
#
# ====================================================================
# Comments about typographical errors and misspellings found in the
# literature:
#
# Damerau 1964: - 80% are single errors: insert, delete, substitute or
#                                        transpose
#               - Statistic given: 567/964 (59%) substitutions
#                                  153/964 (16%) deletions
#                                   23/964 ( 2%) transpositions
#                                   99/964 (10%) insertions 
#                                  122/964 (13%) multiple errors
#
# Hall 1980: - OCR and other automatic devices introduce similar
#              errors of substitutions, deletions and insertions, but
#              not transpositions; frequency and type of errors are
#              characteristics of the device.
#
# Pollock/Zamora 1984: - OCR output contains almost exclusively
#                        substitution errors which ordinarily account
#                        for less than 20% of key boarded misspellings
#                      - 90-95% of misspellings in raw keyboarding
#                        typically only contain one error.
#                      - Only 7.8% of the first letter of misspellings
#                        were incorrect, compared to 11.7% of the
#                        second and 19.2% of the third.
#                      - Generally assumed that vowels are less
#                        important than consonants.
#                      - The frequency of a misspelling seems to be
#                        determined more by the frequency of it's
#                        parent word than by the difficulty of
#                        spelling it.
#                      - Most errors are mechanical (typos), not the
#                        result of poor spelling.
#                      - The more frequent a letter, the more likely
#                        it is to be miskeyed.
#                      - Deletions are similar frequent than trans-
#                        positions, but more frequent than insertions
#                        and again more frequent than substitutions.
#
# Pollock/Zamora 1983: - Study of 50,000 nonword errors, 3-4 character
#                        misspellings constitute only 9.2% of total
#                        misspellings, but they generate 40% of
#                        miscorrections.
#
# Peterson 1986: In two studies:
#                - Transpose two letters:  2.6% / 13.1%
#                - Insert letter:         18.7% / 20.3%
#                - Delete letter:         31.6% / 34.4%
#                - Substitute letter:     40.0% / 26.9%     
#
# Kukich 1990: - Over 63% of errors in TDD conversations occur in
#                words of length 2, 3 or 4.
#
# Kukich 1992: - 13% of non-word spelling errors in a 40,000 corpus of
#                typed conversations involved merging of two words, 2%
#                splitting a word (often at valid forms, "forgot" ->
#                "for got").
#              - Most misspellings seem to be within two characters in
#                length of the correct spelling.
#
# ====================================================================
# Other comments:
#
# - Intuitively, one can assume that long and unfrequent words are
#   more likely to be misspelt than short and common words.
#
# ====================================================================

givenname_dict = {'name':'given_name',
                  'type':'freq',
            'char_range':'alpha',
             'freq_file':'data'+os.sep+'givenname-freq.csv',
           'select_prob':0.10,
         'misspell_file':'data'+os.sep+'givenname-misspell.tbl',
         'misspell_prob':0.30,
              'ins_prob':0.05,
              'del_prob':0.15,
              'sub_prob':0.35,
            'trans_prob':0.05,
         'val_swap_prob':0.02,
         'wrd_swap_prob':0.02,
          'spc_ins_prob':0.01,
          'spc_del_prob':0.01,
             'miss_prob':0.02,
          'new_val_prob':0.02}

surname_dict = {'name':'surname',
                'type':'freq',
          'char_range':'alpha',
           'freq_file':'data'+os.sep+'surname-freq.csv',
         'select_prob':0.15,
       'misspell_file':'data'+os.sep+'surname-misspell.tbl',
       'misspell_prob':0.30,
            'ins_prob':0.10,
            'del_prob':0.10,
            'sub_prob':0.35,
          'trans_prob':0.04,
       'val_swap_prob':0.02,
       'wrd_swap_prob':0.02,
        'spc_ins_prob':0.01,
        'spc_del_prob':0.02,
           'miss_prob':0.02,
        'new_val_prob':0.02}
streetnumber_dict = {'name':'street_number',
                     'type':'freq',
               'char_range':'digit',
                'freq_file':'data'+os.sep+'streetnumber-freq.csv',
              'select_prob':0.10,
                 'ins_prob':0.10,
                 'del_prob':0.15,
                 'sub_prob':0.60,
               'trans_prob':0.05,
            'val_swap_prob':0.05,
            'wrd_swap_prob':0.01,
             'spc_ins_prob':0.00,
             'spc_del_prob':0.00,
                'miss_prob':0.02,
             'new_val_prob':0.02}

address1_dict = {'name':'address_1',
                 'type':'freq',
           'char_range':'alpha',
            'freq_file':'data'+os.sep+'address1-freq.csv',
          'select_prob':0.10,
             'ins_prob':0.10,
             'del_prob':0.15,
             'sub_prob':0.55,
           'trans_prob':0.05,
        'val_swap_prob':0.02,
        'wrd_swap_prob':0.03,
         'spc_ins_prob':0.02,
         'spc_del_prob':0.03,
            'miss_prob':0.04,
         'new_val_prob':0.01}

# Address 2 contains property and institution names - only use rarely
# (set missing probability to a high value).
#
address2_dict = {'name':'address_2',
                 'type':'freq',
           'char_range':'alpha',
            'freq_file':'data'+os.sep+'address2-freq.csv',
          'select_prob':0.10,
             'ins_prob':0.04,
             'del_prob':0.04,
             'sub_prob':0.10,
           'trans_prob':0.02,
        'val_swap_prob':0.03,
        'wrd_swap_prob':0.04,
         'spc_ins_prob':0.02,
         'spc_del_prob':0.01,
            'miss_prob':0.60,
         'new_val_prob':0.10}

suburb_dict = {'name':'suburb',
               'type':'freq',
         'char_range':'alpha',
          'freq_file':'data'+os.sep+'suburb-freq.csv',
        'select_prob':0.10,
      'misspell_file':'data'+os.sep+'suburb-misspell.tbl',
      'misspell_prob':0.40,
           'ins_prob':0.10,
           'del_prob':0.15,
           'sub_prob':0.22,
         'trans_prob':0.04,
      'val_swap_prob':0.01,
      'wrd_swap_prob':0.02,
       'spc_ins_prob':0.02,
       'spc_del_prob':0.02,
          'miss_prob':0.01,
       'new_val_prob':0.01}

postcode_dict = {'name':'postcode',
                 'type':'freq',
           'char_range':'digit',
            'freq_file':'data'+os.sep+'postcode-freq.csv',
          'select_prob':0.05,
             'ins_prob':0.00,
             'del_prob':0.00,
             'sub_prob':0.35,
           'trans_prob':0.60,
        'val_swap_prob':0.03,
        'wrd_swap_prob':0.00,
         'spc_ins_prob':0.00,
         'spc_del_prob':0.00,
            'miss_prob':0.01,
         'new_val_prob':0.01}

state_dict = {'name':'state',
              'type':'freq',
        'char_range':'alpha',
         'freq_file':'data'+os.sep+'state-freq.csv',
       'select_prob':0.05,
          'ins_prob':0.10,
          'del_prob':0.10,
          'sub_prob':0.55,
        'trans_prob':0.02,
     'val_swap_prob':0.03,
     'wrd_swap_prob':0.00,
      'spc_ins_prob':0.00,
      'spc_del_prob':0.00,
         'miss_prob':0.10,
      'new_val_prob':0.10}

dob_dict = {'name':'date_of_birth',
            'type':'date',
      'char_range':'digit',
      'start_date':(01,01,1900),
        'end_date':(31,12,1999),
     'select_prob':0.10,
        'ins_prob':0.00,
        'del_prob':0.00,
        'sub_prob':0.50,
      'trans_prob':0.30,
   'val_swap_prob':0.05,
   'wrd_swap_prob':0.00,
    'spc_ins_prob':0.00,
    'spc_del_prob':0.00,
       'miss_prob':0.10,
    'new_val_prob':0.05}
age_dict = {'name':'age',
            'type':'freq',
      'char_range':'digit',
       'freq_file':'data'+os.sep+'age-freq.csv',
     'select_prob':0.05,
        'ins_prob':0.00,
        'del_prob':0.00,
        'sub_prob':0.30,
      'trans_prob':0.20,
   'val_swap_prob':0.20,
   'wrd_swap_prob':0.00,
    'spc_ins_prob':0.00,
    'spc_del_prob':0.00,
       'miss_prob':0.20,
    'new_val_prob':0.10}

phonenum_dict = {'name':'phone_number',
                 'type':'phone',
           'char_range':'digit',
           'area_codes':['02','03','04','07','08'],
           'num_digits':8,
          'select_prob':0.05,
             'ins_prob':0.00,
             'del_prob':0.00,
             'sub_prob':0.40,
           'trans_prob':0.30,
        'val_swap_prob':0.15,
        'wrd_swap_prob':0.00,
         'spc_ins_prob':0.00,
         'spc_del_prob':0.00,
            'miss_prob':0.05,
         'new_val_prob':0.10}

ssid_dict = {'name':'soc_sec_id',
             'type':'ident',
       'char_range':'digit',
         'start_id':1000000,
           'end_id':9999999,
      'select_prob':0.05,
         'ins_prob':0.00,
         'del_prob':0.00,
         'sub_prob':0.50,
       'trans_prob':0.40,
    'val_swap_prob':0.10,
    'wrd_swap_prob':0.00,
     'spc_ins_prob':0.00,
     'spc_del_prob':0.00,
        'miss_prob':0.00,
     'new_val_prob':0.00}
# Create a field which can be used for blocking (generate values 0 to
# 9 which are not modified in duplicates).
#
blocking_dict = {'name':'blocking_number',
                 'type':'ident',
           'char_range':'digit',
             'start_id':0,
               'end_id':10,
          'select_prob':0.00,
             'ins_prob':0.00,
             'del_prob':0.00,
             'sub_prob':0.00,
           'trans_prob':0.00,
        'val_swap_prob':0.00,
        'wrd_swap_prob':0.00,
         'spc_ins_prob':0.00,
         'spc_del_prob':0.00,
            'miss_prob':0.00,
         'new_val_prob':0.00}

# --------------------------------------------------------------------
# Probabilities (between 0.0 and 1.0) for swapping values between two
# fields. Use field names as defined in the field directories (keys
# 'name').

field_swap_prob = {('address_1',  'address_2'):0.02,
                   ('given_name', 'surname'):  0.05,
                   ('postcode',   'suburb'):   0.01}

# --------------------------------------------------------------------
# Probabilities (between 0.0 and 1.0) for creating a typographical
# error (a new character) in the same row or the same column. This is
# used in the random selection of a new character in the 'sub_prob'
# (substitution of a character in a field).

single_typo_prob = {'same_row':0.40,
                    'same_col':0.30}

# --------------------------------------------------------------------
# Now add all field dictionaries into a list according to how they 
# should be saved in the output file.

field_list = [givenname_dict, surname_dict, streetnumber_dict,
              address1_dict, address2_dict, suburb_dict,
              postcode_dict, state_dict, dob_dict, age_dict,
              phonenum_dict, ssid_dict, blocking_dict]

# --------------------------------------------------------------------
# Flag for writing a header line (keys 'name' of field dictionaries)

save_header = True  # Set to 'False' if no header should be written

# --------------------------------------------------------------------
# String to be inserted for missing values

missing_value = ''

# ====================================================================

Let's make an example and create a very small data set with ten original records and ten duplicate records (with a maximal of 4 duplicate records being created using one original record, and maximal two modifications per field and per record). A poisson distribution should be used for the creation of duplicates, and the resulting 20 records should be written into the comma separated text file mydata.csv. The following call of generate.py will create such a data set.

python generate.py mydata.csv 10 10 4 2 2 poisson

The output file mydata.csv will contain a header line with the field names (as defined with the key 'name' in the field dictionaries as given above), followed by the 20 records in an unsorted (random) sequence.

The content of mydata.csv is given on the following page. For easier viewing the records have been sorted (using the Unix command sort) so that original and duplicate records are listed together.

\rotatebox{90}{\texttt{rec_id, given_name, surname, street_number,
address_1, a...
...ostcode, state, date_of_birth, age,
phone_number, soc_sec_id, blocking_number}} \rotatebox{90}{\texttt{rec-0-org, , trimmer, 569, ragless circuit, ,
banks, 2603, vic, , 33, 08 21525495, 4349826, 3}} \rotatebox{90}{\texttt{rec-1-dup-0, joel, ryan, 43, burnie street, ,
gowrie, 2902, vic, 19810315, 19, 04 53396828, 8695140, 8}} \rotatebox{90}{\texttt{rec-1-dup-1, joel, ryan, 43, burnie street,
the manor garden, gowrie, 2902, vic, 19801315, 19, 04 53396828,
8691545, 8}} \rotatebox{90}{\texttt{rec-1-dup-2, ryan, joel, 43, burnie street, ,
gowrie, 2902, vic, 19810315, 19, 04 53396828, 8691545, 8}} \rotatebox{90}{\texttt{rec-1-org, joel, ryan, 43, burnie street, ,
gowrie, 2902, vic, 19810315, 19, 04 53396828, 8691545, 8}} \rotatebox{90}{\texttt{rec-2-org, george, lavis, 240, balanu place,
, theodore, 2906, vic, 19300729, 31, 07 70667013, 1869905, 3}} \rotatebox{90}{\texttt{rec-3-org, freya, simmonds, 57, boobialla
street, baldivis estate, nicholls, 2913, qld, 19980722, 33, 07
93481020, 1285003, 1}} \rotatebox{90}{\texttt{rec-4-dup-0, joel, kemmif, 28, hackett garden,
kilvintonjvillage, braddobn, 2615, vic, 19880302, 26, 07 20582556,
8873200, 3}} \rotatebox{90}{\texttt{rec-4-dup-1, joel, kemmif, 28, hackett garden,
kilvinton village, braddon, 2615, vic, 19880302, 22, 07 20582556,
8873200, 3}} \rotatebox{90}{\texttt{rec-4-dup-2, , kemmif, 28, hackett garden,
kilvinton village, braddon, 2615, vic, 19880302, 26, 07 20582556,
8873200, 3}} \rotatebox{90}{\texttt{rec-4-org, joel, kemmif, 28, hackett garden,
kilvinton village, braddon, 2615, vic, 19880302, 26, 07 20582556,
8873200, 3}} \rotatebox{90}{\texttt{rec-5-org, zane, abrahamson, 13, culgoa
circuit, glencot, page, 2903, wa, 19770401, 25, 03 63557232,
9811654, 5}} \rotatebox{90}{\texttt{rec-6-org, lauren, germinario, 15, somerset
street, , calwell, 2604, qld, 19460320, , 08 58178178, 7227812, 4}} \rotatebox{90}{\texttt{rec-7-dup-0, benjamin, rogan, 110, antill
sgreet, , wanniassa, 2614, npw, 19760211, 33, 04 30982549, 4221809,
1}} \rotatebox{90}{\texttt{rec-7-dup-1, benjamin, rogan, 11, antill
street, , wanniassa, 2614, nsw, 19760711, 33, 04 30982549, 4221809,
1}} \rotatebox{90}{\texttt{rec-7-dup-2, benjamin, rogan, 110, antill
street, , wannissa, 2614, nsw, 19760211, 33, 04 30982549, 4221809,
1}} \rotatebox{90}{\texttt{rec-7-dup-3, benjamin, rogan, 110, antill
street, sun village, wanniassa, 2614, nsw, 19760211, 33, 04
30984549, 4221809, 1}} \rotatebox{90}{\texttt{rec-7-org, benjamin, rogan, 110, antill
street, , wanniassa, 2614, nsw, 19760211, 33, 04 30982549, 4221809,
1}} \rotatebox{90}{\texttt{rec-8-org, george, rovira, 32, phillis place,
, isaacs, 2605, vic, 19990815, 28, 03 08638787, 2684728, 9}} \rotatebox{90}{\texttt{rec-9-org, dylan, oglanby, 13, north place,
all saints college, braddon, 2903, nsw, 19510904, 25, , 1383574, 7}}