Data sets can be available in many different formats, often stored either as portable text files or in databases, but sometimes also in efficient non-portable binary files, or in application specific file formats. Within Febrl, the module dataset.py provides the means to access various data set formats. The module has been written to make it easy to add additional methods to read new data formats in the future.
Generally, one can distinguish between data sets that only allow sequential record access, and data sets that allow direct random access to any record stored (using a unique record identifier). While sequential data sets are mainly used for persistent storage of data sets (like in text files or databases), direct random access data sets are very useful for temporary storage - for example after records have been cleaned and standardised and before they are used in a deduplication or linkage process.
The Febrl system currently supports the following five data set implementations.
PQSQL- has been made available by an external contributer for accessing the open source database PostgreSQL.. It's interface is very similar to the
SQLdata set implementation. For more details please see the module dataset.py.
Records within a data set in Febrl are internally identified by unique integer numbers (starting with zero). These numbers are not stored in data sets with sequential access, but are used in data sets with direct random access to identify records.
Data sets in Febrl need to be initialised. This merely means that the Febrl program needs to do whatever is necessary to prepare to read or write records to the data sets. A data set can be initialised in one of several access modes, and all further access is restricted to this access mode (for example, a data set initialised for read access can not be written to). This allows one to have input data sets with read access only, temporary data sets with read and write access, and output data data sets with write or append access restrictions.
The following access modes are supported and can be set using the
access_mode when a data set is initialised.
Each record in a data set comprises one or more
Python dictionary with the field names as keys needs to be given when
a data set is initialised, as can be seen in the examples in the
following subsections. The values in this dictionary depend on the
data set type (e.g. for a CSV data set the column numbers need to be
given, while for an SQL data set the database column names are
needed). For both the shelve and memory data sets only the field names
are used, so the values in the fields dictionary can be anything.
It is not necessary to define all fields in a data set. Only the ones that are used in a data cleaning and standardisation or in a record linkage process need to be listed. For example, a data set can contain names and addresses of patients, plus many medical attributes which are not used in a record linkage project (see the various examples throughout this chapter), in which case only the used name and address fields need to be defined in the fields dictionary.
When reading from a data set, records are returned as Python
key:value pairs, where the keys are the
field names and the values the corresponding values read from the data
set. Note that only non-empty fields are returned, thus such a record
dictionary can be quite small even if a data set has many fields.
Similarly, when writing records into a data set, they also have to be
in a dictionary with the field names in the data set as keys. For
fields that are not stored in a dictionary, a
value will be written into the data set (for sequential access data
When reading from a data set, for each record two hidden fields
are returned, namely
_rec_num_ (the record identifier
_dataset_name_ (the name of the data set the
record is read from). Using these hidden fields a record can be
uniquely identified. Note that for direct access data sets the hidden
_rec_num_ is used as a record identifier when records
are written into such a data set.
All data set implementations have the following attributes, which can be set with the corresponding key word (argument) when a data set is initialised.
'append'(sequential access data sets only) or
'readwrite'(direct random access data sets only) as explained above.
missing_valueslist, the corresponding matching weight is set to a missing value weight (see Section 9.2 for more details).
Sequential data sets (i.e. COL, CSV and SQL) have the following additional attribute that can be set when a data set is initialised.
False. If set to
True(the default value), all fields are stripped of their whitespaces before they are written into a data set or read from a data set. If set to
False, whitespaces are not stripped off field values.
Two attributes that should be read only (one of them for sequential access data sets only) are
writeaccess mode. For all other access modes, the number of stored records in the data set is calculated, which might take a while (depending on the data set implementation).
Besides initialisation, all data set implementations have the following methods to read or write records.
read_record()(sequential access data sets)
read_record(record_number)(direct random access data sets)
next_record_numnumber is returned (and the value of
next_record_numincreased by one). If no more records are available,
_rec_num_is returned. If the record with the given number is not in the data set,
Noneis returned and a warning message is triggered.
read_records(first_record, number_records)(sequential access data sets)
read_records(record_number_list)(direct random access data sets)
first_record+number_records-1. After a block of records is read successfully, the value of
next_record_numis set to the number of the record following the block. Records are returned in a list.
_rec_num_is not used), while for direct random access data sets the record written with the value of
_rec_num_as identifier (if there was already a record with this number in the data set it is overwritten). For sequential data sets, the value of
num_recordsis increased by one in any case, while for direct random access data sets it is only increased if previously no record was stored in the data sets with the given record number.
num_recordsis increased by the number of records written, while for direct random access data sets the hidden fields
_rec_num_are used to write records into the data set. The value of
num_recordswith direct random access data sets is only increased by the number of records with new record numbers.
Direct random access data sets have another method that allows them to be re-initialise.
re_initialise(access_mode)(direct random access data sets)
access_modeargument is given, the data set will be initialised in the same access mode. Alternatively a new access mode can be given.
readaccess mode so that all processes are able to read all the records in such a data set.
'host'(the default) or
'all'. In the case where all processes are writing, each of them will write into its own (local) data sets. This is done by adding the process number (for example
P0for the host,
P2, etc. for the other processes) to the file names. For the SQL data set only the
'host'write access mode is currently supported.
In the following sections details on the data set implementations and their specific attributes are given.