11.2 Database generator Program 'generate.py'

As record linkage is dealing with data sets that contain partially identified data, like names and addresses, it can be very difficult to test and evaluate newly developed software for record linkage. For the user it can be difficult to learn how to apply, evaluate and customise record linkage software effectively without example data sets where the linkage (or deduplication) status is known.

To overcome this we have developed a database generator based on ideas by Hernandez [12]. This generator can create data sets that contain names (based on frequency look-up tables for surnames and given names), addresses (based on frequency tables for suburbs, postcodes, street number and addresses, and state or territory names), dates (like date of birth), and identifier numbers (like a social security number).

This database generator and all its associated files (frequency files, a README.txt documentation, and several example data sets) are part of the Febrl distribution and are stored in the directory dbgen/ within the main Febrl directory. Please read the README.txt documentation in dbgen/ for more information on the content of this directory.

In a first step a number of original records are generated, and in a second step duplicates of these original records are created using randomly introduced modifications. Each of these records is given a unique identifier, which allows the evaluation of accuracy and error rates (false linked record pairs and un-linked true matches) of deduplication and linkage procedures.

Duplicates are created by randomly introducing modifications (with user definable probabilities) of the following forms:

The generate.py program can be started from the command line with the following argument list

Many more parameters (like all the different probabilities of how modifications are introduced, and the names of the frequency look-up table files to be used for names and addresses) have to be edited within the generate.py module itself as shown in the following code section given below (which is directly taken from generate.py).

# ====================================================================
# Set this flag to True for verbose output, otherwise to False
#
VERBOSE_OUTPUT = True

# ====================================================================
#
# For each field (attribute), a dictionary has to be defined with the
# following keys (probabilities can have values between 0.0 and 1.0):
# - name           The field name to be used when a header is written
#                  into the output file.
# - type           The type of the field. Possible are:
#                  'freq' (for fields that use a frequency table with
#                          field values).
#                  'date' (for date fields in a certain range).
#                  'iden' (for numerical identifier fields in a
#                          certain range).
# - char_range     The range of random characters that can be
#                  introduced. Can be one of 'alpha', 'digit', or
#                  'alphanum'.
# - freq_file      The name of a frequency file (for fields of type
#                  'freq' only).
# - start_date     A start date (for fields of type 'date' only),
#                  must be a tuple (day,month,year).
# - end_date       A end date (for fields of type 'date' only), must
#                  be a tuple (day,month,year).
# - start_id       A start identification number (for fields of type
#                  'iden' only).
# - end_id         A end identification number (for fields of type
#                  'iden' only).
#
# - ins_prob       Probability to insert a character into a field.
# - del_prob       Probability to delete a character from a field.
# - sub_prob       Probability to substitute a character in a field
#                  with another character.
# - trans_prob     Probability to transpose two characters in a field.
# - val_swap_prob  Probability to swap the value in a field with
#                  another (randomly selected) value for this field
#                  (taken from this field's look-up table).
# - spc_ins_prob   Probability to insert a space into a field (thus
#                  splitting a word).
# - spc_del_prob   Probability to delete a space (if available) in a
#                  field (and thus merging two words).
# - miss_prob      Probability to set a field value to missing
#                  (empty).

givenname_dict = {'name':'given_name',
                  'type':'freq',
            'char_range':'alpha',
             'freq_file':'data/givenname.csv',
              'ins_prob':0.03,
              'del_prob':0.04,
              'sub_prob':0.05,
            'trans_prob':0.03,
         'val_swap_prob':0.08,
          'spc_ins_prob':0.01,
          'spc_del_prob':0.00,
             'miss_prob':0.02}

surname_dict = {'name':'surname',
                'type':'freq',
          'char_range':'alpha',
           'freq_file':'data/surname.csv',
              'ins_prob':0.05,
              'del_prob':0.04,
              'sub_prob':0.06,
            'trans_prob':0.05,
         'val_swap_prob':0.05,
          'spc_ins_prob':0.01,
          'spc_del_prob':0.01,
             'miss_prob':0.01}


streetnumber_dict = {'name':'street_num',
                     'type':'freq',
               'char_range':'digit',
                'freq_file':'data/streetnumber.csv',
                 'ins_prob':0.0,
                 'del_prob':0.01,
                 'sub_prob':0.0,
               'trans_prob':0.02,
            'val_swap_prob':0.10,
             'spc_ins_prob':0.0,
             'spc_del_prob':0.0,
                'miss_prob':0.03}

address1_dict = {'name':'address_1',
                 'type':'freq',
           'char_range':'alpha',
            'freq_file':'data/address1.csv',
             'ins_prob':0.05,
             'del_prob':0.05,
             'sub_prob':0.07,
           'trans_prob':0.05,
        'val_swap_prob':0.02,
         'spc_ins_prob':0.05,
         'spc_del_prob':0.05,
            'miss_prob':0.02}

address2_dict = {'name':'address_2',
                 'type':'freq',
           'char_range':'alpha',
            'freq_file':'data/address2.csv',
             'ins_prob':0.04,
             'del_prob':0.04,
             'sub_prob':0.08,
           'trans_prob':0.10,
        'val_swap_prob':0.01,
         'spc_ins_prob':0.10,
         'spc_del_prob':0.05,
            'miss_prob':0.09}

suburb_dict = {'name':'suburb',
               'type':'freq',
           'char_range':'alpha',
          'freq_file':'data/suburb.csv',
           'ins_prob':0.02,
           'del_prob':0.03,
           'sub_prob':0.07,
         'trans_prob':0.05,
      'val_swap_prob':0.05,
       'spc_ins_prob':0.02,
       'spc_del_prob':0.01,
          'miss_prob':0.01}

postcode_dict = {'name':'postcode',
                 'type':'freq',
           'char_range':'digit',
            'freq_file':'data/postcode.csv',
             'ins_prob':0.00,
             'del_prob':0.00,
             'sub_prob':0.05,
           'trans_prob':0.10,
        'val_swap_prob':0.01,
         'spc_ins_prob':0.0,
         'spc_del_prob':0.0,
            'miss_prob':0.0}

state_dict = {'name':'state',
              'type':'freq',
        'char_range':'alpha',
         'freq_file':'data/state.csv',
          'ins_prob':0.0,
          'del_prob':0.0,
          'sub_prob':0.01,
        'trans_prob':0.01,
     'val_swap_prob':0.02,
      'spc_ins_prob':0.0,
      'spc_del_prob':0.0,
         'miss_prob':0.01}

dob_dict = {'name':'date_of_birth',
            'type':'date',
      'char_range':'digit',
      'start_date':(01,01,1900),
        'end_date':(31,12,1999),
        'ins_prob':0.0,
        'del_prob':0.0,
        'sub_prob':0.01,
      'trans_prob':0.01,
   'val_swap_prob':0.04,
    'spc_ins_prob':0.0,
    'spc_del_prob':0.0,
       'miss_prob':0.02}

ssid_dict = {'name':'soc_sec_id',
             'type':'iden',
       'char_range':'digit',
         'start_id':1000000,
           'end_id':9999999,
         'ins_prob':0.0,
         'del_prob':0.0,
         'sub_prob':0.02,
       'trans_prob':0.03,
    'val_swap_prob':0.04,
     'spc_ins_prob':0.0,
     'spc_del_prob':0.0,
        'miss_prob':0.00}


# --------------------------------------------------------------------
# Now add all field dictionaries into a list according to how they
# should be saved in the output file

field_list = [givenname_dict, surname_dict, streetnumber_dict,
              address1_dict, address2_dict, suburb_dict,
              postcode_dict, state_dict, dob_dict, ssid_dict]

# --------------------------------------------------------------------
# Flag for writing a header line (keys 'name' of field dictionaries)

save_header = True  # Set to 'False' if no header should be written

# --------------------------------------------------------------------
# Probabilities (between 0.0 and 1.0) for swapping values between two
# fields.
# Use field names as defined in the field directories (keys 'name').

field_swap_prob = {('address_1', 'address_2'):0.05,
                   ('given_name', 'surname'):0.07}

# --------------------------------------------------------------------
# Probabilities (between 0.0 and 1.0) for creating a typographical
# error (a new character) in the same row or the same column.
# This is used in the random selection of a new character in the
# 'sub_prob' (substitution of a character in a field)

single_typo_prob = {'same_row':0.4,
                    'same_col':0.3}

# --------------------------------------------------------------------
# String to be inserted for missing values

missing_value = ''

# ====================================================================

Lets make an example and create a very small data set with 10 original records and 10 duplicate records (with a maximal of 4 duplicate records being created using one original record). A poisson distribution should be used for the creation of duplicates, and the resulting 20 records should written into the text file mydata.txt. The following call of generate.py will create such a data set.

The output file mydata.txt will contain a header line with the field names (as defined with the key 'name' in the field dictionaries as given above), followed by the 20 records in an unsorted (random) sequence.

The content of mydata.txt is given below. For easier viewing the records have been sorted (using the Unix command sort) so that original and duplicate records are listed together.

id, given_name, surname, street_num, address_1, address_2, suburb, postcode, state, date_of_birth, soc_sec_id rec-0-dup-0, sybella, white, 11, wyllyplace, inverpine ret vill, tar ee, 4860, sa, 19690705, 8112687 rec-0-dup-1, sybella, carmody, 11, inverpine ret vill, wylly place, taree, 4860, sa, 19690705, 8112678 rec-0-dup-2, sybella, carmody, 3, wylly place, inverpine ret vill, tared, 4860, sa, 19690705, 8112678 rec-0-dup-3, carmody, sybella, 11, wylly place, inverpine ret vill, taree, 4680, sa, 19690705, 8112678 rec-0-org, sybella, carmody, 11, wylly place, inverpine ret vill, taree, 4860, sa, 19690705, 8112678 rec-1-org, lewis, salehi, 15, stuart street, hartford, mentone, 3135, nsw, 19390215, 4314779 rec-2-dup-0, joel, prak, 20, griffith sstreet, myross, st kilda east, 2756, vic, 19490316, 1872693 rec-2-dup-1, joel, beganovic, 20, griffith sstreet, myross, st kilda east, 2756, vic, 19490316, 1872693 rec-2-org, joel, beganovic, 20, griffiths street, myross, st kilda east, 2756, vic, 19490316, 1872693 rec-3-org, erin, denholm, 5, ellenborough street, elura homestead, bundaberg, 4350, nsw, 19320706, 5471990 rec-4-dup-0, white, shakirah, 6, cherry hvaen, dacombocourt, newcaslte, 2089, vic, 19080704, 6529770 rec-4-org, shakirah, white, 6, dacomb court, cherry haven, newcastle, 2089, vic, 19080704, 6529770 rec-5-dup-0, michael, campnaa, , eldridge street, elm brook, eight mile plains, 2040, qld, 19310122, 9989150 rec-5-dup-1, michael, campana, 21, eldridegstreet, , eight mil e plains, 2040, qld, 19310122, 9989150 rec-5-dup-2, michael, campana, 12, eldridge street, elm brook, wollongong north, 2040, qld, 19310122, 9989250 rec-5-org, michael, campana, 21, eldridge street, elmm brook, eight mile plains, 2040, qld, 19310122, 9989150 rec-6-org, luke, white, 15, witt place, flr 1 the professional centre, new farm, 2304, qld, 19600207, 4416617 rec-7-org, caitlin, goonan, 35, burkitt street, , parramatta, 2024, vic, 19031018, 4987722 rec-8-org, thomas, green, 20, tinderry circuit, electorate office, emerald, 3040, nsw, 19880220, 4303596 rec-9-org, mason, matthews, 19, tadgell place, , parkville, 6010, nsw, , 2073052