As record linkage is dealing with data sets that contain partially identified data, like names and addresses, it can be very difficult to test and evaluate newly developed software for record linkage. For the user it can be difficult to learn how to apply, evaluate and customise record linkage software effectively without example data sets where the linkage (or deduplication) status is known.
To overcome this we have developed a database generator based on ideas by Hernandez [12]. This generator can create data sets that contain names (based on frequency look-up tables for surnames and given names), addresses (based on frequency tables for suburbs, postcodes, street number and addresses, and state or territory names), dates (like date of birth), and identifier numbers (like a social security number).
This database generator and all its associated files (frequency files,
a README.txt
documentation, and several example data sets) are
part of the Febrl distribution and are stored in the
directory dbgen/
within the main Febrl directory.
Please read the README.txt
documentation in dbgen/
for
more information on the content of this directory.
In a first step a number of original records are generated, and in a second step duplicates of these original records are created using randomly introduced modifications. Each of these records is given a unique identifier, which allows the evaluation of accuracy and error rates (false linked record pairs and un-linked true matches) of deduplication and linkage procedures.
Duplicates are created by randomly introducing modifications (with user definable probabilities) of the following forms:
The generate.py program can be started from the command line with the following argument list
python generate.py
output_file num_records
num_duplicates
The needed arguments are
'rec-
X-org'
with X being a record
number starting from 0.
'rec-
X-dup-
Y'
with X
being the record number of the original record upon which a
duplicate is based on, and Y being a duplicate sequence
number (starting from 0), i.e. if more than one duplicate
record is created based on one original record, these duplicates
are numbered from 0 onwards.
uniform
, poisson
, or
zipf
.
Many more parameters (like all the different probabilities of how modifications are introduced, and the names of the frequency look-up table files to be used for names and addresses) have to be edited within the generate.py module itself as shown in the following code section given below (which is directly taken from generate.py).
# ==================================================================== # Set this flag to True for verbose output, otherwise to False # VERBOSE_OUTPUT = True # ==================================================================== # # For each field (attribute), a dictionary has to be defined with the # following keys (probabilities can have values between 0.0 and 1.0): # - name The field name to be used when a header is written # into the output file. # - type The type of the field. Possible are: # 'freq' (for fields that use a frequency table with # field values). # 'date' (for date fields in a certain range). # 'iden' (for numerical identifier fields in a # certain range). # - char_range The range of random characters that can be # introduced. Can be one of 'alpha', 'digit', or # 'alphanum'. # - freq_file The name of a frequency file (for fields of type # 'freq' only). # - start_date A start date (for fields of type 'date' only), # must be a tuple (day,month,year). # - end_date A end date (for fields of type 'date' only), must # be a tuple (day,month,year). # - start_id A start identification number (for fields of type # 'iden' only). # - end_id A end identification number (for fields of type # 'iden' only). # # - ins_prob Probability to insert a character into a field. # - del_prob Probability to delete a character from a field. # - sub_prob Probability to substitute a character in a field # with another character. # - trans_prob Probability to transpose two characters in a field. # - val_swap_prob Probability to swap the value in a field with # another (randomly selected) value for this field # (taken from this field's look-up table). # - spc_ins_prob Probability to insert a space into a field (thus # splitting a word). # - spc_del_prob Probability to delete a space (if available) in a # field (and thus merging two words). # - miss_prob Probability to set a field value to missing # (empty).
givenname_dict = {'name':'given_name', 'type':'freq', 'char_range':'alpha', 'freq_file':'data/givenname.csv', 'ins_prob':0.03, 'del_prob':0.04, 'sub_prob':0.05, 'trans_prob':0.03, 'val_swap_prob':0.08, 'spc_ins_prob':0.01, 'spc_del_prob':0.00, 'miss_prob':0.02} surname_dict = {'name':'surname', 'type':'freq', 'char_range':'alpha', 'freq_file':'data/surname.csv', 'ins_prob':0.05, 'del_prob':0.04, 'sub_prob':0.06, 'trans_prob':0.05, 'val_swap_prob':0.05, 'spc_ins_prob':0.01, 'spc_del_prob':0.01, 'miss_prob':0.01} streetnumber_dict = {'name':'street_num', 'type':'freq', 'char_range':'digit', 'freq_file':'data/streetnumber.csv', 'ins_prob':0.0, 'del_prob':0.01, 'sub_prob':0.0, 'trans_prob':0.02, 'val_swap_prob':0.10, 'spc_ins_prob':0.0, 'spc_del_prob':0.0, 'miss_prob':0.03} address1_dict = {'name':'address_1', 'type':'freq', 'char_range':'alpha', 'freq_file':'data/address1.csv', 'ins_prob':0.05, 'del_prob':0.05, 'sub_prob':0.07, 'trans_prob':0.05, 'val_swap_prob':0.02, 'spc_ins_prob':0.05, 'spc_del_prob':0.05, 'miss_prob':0.02}
address2_dict = {'name':'address_2', 'type':'freq', 'char_range':'alpha', 'freq_file':'data/address2.csv', 'ins_prob':0.04, 'del_prob':0.04, 'sub_prob':0.08, 'trans_prob':0.10, 'val_swap_prob':0.01, 'spc_ins_prob':0.10, 'spc_del_prob':0.05, 'miss_prob':0.09} suburb_dict = {'name':'suburb', 'type':'freq', 'char_range':'alpha', 'freq_file':'data/suburb.csv', 'ins_prob':0.02, 'del_prob':0.03, 'sub_prob':0.07, 'trans_prob':0.05, 'val_swap_prob':0.05, 'spc_ins_prob':0.02, 'spc_del_prob':0.01, 'miss_prob':0.01} postcode_dict = {'name':'postcode', 'type':'freq', 'char_range':'digit', 'freq_file':'data/postcode.csv', 'ins_prob':0.00, 'del_prob':0.00, 'sub_prob':0.05, 'trans_prob':0.10, 'val_swap_prob':0.01, 'spc_ins_prob':0.0, 'spc_del_prob':0.0, 'miss_prob':0.0} state_dict = {'name':'state', 'type':'freq', 'char_range':'alpha', 'freq_file':'data/state.csv', 'ins_prob':0.0, 'del_prob':0.0, 'sub_prob':0.01, 'trans_prob':0.01, 'val_swap_prob':0.02, 'spc_ins_prob':0.0, 'spc_del_prob':0.0, 'miss_prob':0.01}
dob_dict = {'name':'date_of_birth', 'type':'date', 'char_range':'digit', 'start_date':(01,01,1900), 'end_date':(31,12,1999), 'ins_prob':0.0, 'del_prob':0.0, 'sub_prob':0.01, 'trans_prob':0.01, 'val_swap_prob':0.04, 'spc_ins_prob':0.0, 'spc_del_prob':0.0, 'miss_prob':0.02} ssid_dict = {'name':'soc_sec_id', 'type':'iden', 'char_range':'digit', 'start_id':1000000, 'end_id':9999999, 'ins_prob':0.0, 'del_prob':0.0, 'sub_prob':0.02, 'trans_prob':0.03, 'val_swap_prob':0.04, 'spc_ins_prob':0.0, 'spc_del_prob':0.0, 'miss_prob':0.00} # -------------------------------------------------------------------- # Now add all field dictionaries into a list according to how they # should be saved in the output file field_list = [givenname_dict, surname_dict, streetnumber_dict, address1_dict, address2_dict, suburb_dict, postcode_dict, state_dict, dob_dict, ssid_dict] # -------------------------------------------------------------------- # Flag for writing a header line (keys 'name' of field dictionaries) save_header = True # Set to 'False' if no header should be written # -------------------------------------------------------------------- # Probabilities (between 0.0 and 1.0) for swapping values between two # fields. # Use field names as defined in the field directories (keys 'name'). field_swap_prob = {('address_1', 'address_2'):0.05, ('given_name', 'surname'):0.07} # -------------------------------------------------------------------- # Probabilities (between 0.0 and 1.0) for creating a typographical # error (a new character) in the same row or the same column. # This is used in the random selection of a new character in the # 'sub_prob' (substitution of a character in a field) single_typo_prob = {'same_row':0.4, 'same_col':0.3}
# -------------------------------------------------------------------- # String to be inserted for missing values missing_value = '' # ====================================================================
Lets make an example and create a very small data set with 10 original
records and 10 duplicate records (with a maximal of 4 duplicate
records being created using one original record). A poisson
distribution should be used for the creation of duplicates, and the
resulting 20 records should written into the text file
mydata.txt
. The following call of generate.py will
create such a data set.
python generate.py mydata.txt 10 10 4 poisson
The output file mydata.txt
will contain a header line with the
field names (as defined with the key 'name'
in the field
dictionaries as given above), followed by the 20 records in an
unsorted (random) sequence.
The content of mydata.txt
is given below. For easier viewing
the records have been sorted (using the Unix command sort
) so
that original and duplicate records are listed together.
id, given_name, surname, street_num, address_1, address_2, suburb,
postcode, state, date_of_birth, soc_sec_id
rec-0-dup-0, sybella, white, 11, wyllyplace, inverpine ret vill, tar
ee, 4860, sa, 19690705, 8112687
rec-0-dup-1, sybella, carmody, 11, inverpine ret vill, wylly place,
taree, 4860, sa, 19690705, 8112678
rec-0-dup-2, sybella, carmody, 3, wylly place, inverpine ret vill,
tared, 4860, sa, 19690705, 8112678
rec-0-dup-3, carmody, sybella, 11, wylly place, inverpine ret vill,
taree, 4680, sa, 19690705, 8112678
rec-0-org, sybella, carmody, 11, wylly place, inverpine ret vill,
taree, 4860, sa, 19690705, 8112678
rec-1-org, lewis, salehi, 15, stuart street, hartford, mentone, 3135,
nsw, 19390215, 4314779
rec-2-dup-0, joel, prak, 20, griffith sstreet, myross, st kilda east,
2756, vic, 19490316, 1872693
rec-2-dup-1, joel, beganovic, 20, griffith sstreet, myross, st kilda
east, 2756, vic, 19490316, 1872693
rec-2-org, joel, beganovic, 20, griffiths street, myross, st kilda
east, 2756, vic, 19490316, 1872693
rec-3-org, erin, denholm, 5, ellenborough street, elura homestead,
bundaberg, 4350, nsw, 19320706, 5471990
rec-4-dup-0, white, shakirah, 6, cherry hvaen, dacombocourt,
newcaslte, 2089, vic, 19080704, 6529770
rec-4-org, shakirah, white, 6, dacomb court, cherry haven, newcastle,
2089, vic, 19080704, 6529770
rec-5-dup-0, michael, campnaa, , eldridge street, elm brook,
eight mile plains, 2040, qld, 19310122, 9989150
rec-5-dup-1, michael, campana, 21, eldridegstreet, , eight mil e
plains, 2040, qld, 19310122, 9989150
rec-5-dup-2, michael, campana, 12, eldridge street, elm brook,
wollongong north, 2040, qld, 19310122, 9989250
rec-5-org, michael, campana, 21, eldridge street, elmm brook,
eight mile plains, 2040, qld, 19310122, 9989150
rec-6-org, luke, white, 15, witt place, flr 1 the professional centre,
new farm, 2304, qld, 19600207, 4416617
rec-7-org, caitlin, goonan, 35, burkitt street, , parramatta, 2024,
vic, 19031018, 4987722
rec-8-org, thomas, green, 20, tinderry circuit, electorate office,
emerald, 3040, nsw, 19880220, 4303596
rec-9-org, mason, matthews, 19, tadgell place, , parkville, 6010, nsw,
, 2073052