13. Data Set Access

Data sets can be available in many different formats, often stored either as portable text files or in databases, but sometimes also in efficient non-portable binary files, or in application specific file formats. Within Febrl, the module dataset.py provides the means to access various data set formats. The module has been written to make it easy to add additional methods to read new data formats in the future.

Generally, one can distinguish between data sets that only allow sequential record access, and data sets that allow direct random access to any record stored (using a unique record identifier). While sequential data sets are mainly used for persistent storage of data sets (like in text files or databases), direct random access data sets are very useful for temporary storage - for example after records have been cleaned and standardised and before they are used in a deduplication or linkage process.

The Febrl system currently supports the following five data set implementations.

COL
Text files with fields with fixed column width (sequential access).
CSV
Comma separated values text files (sequential access).
SQL
SQL database access using the open source relational database system MySQL^13.1 (sequential access).
Shelve
Efficient file-based hash-tables (direct random access) provided by Python.
Memory
Efficient in-memory hash-tables (direct random access) provided by Python.

Note: Another data set implementation - named PQSQL - has been made available by an external contributer for accessing the open source database PostgreSQL.. It's interface is very similar to the SQL data set implementation. For more details please see the module dataset.py.

Records within a data set in Febrl are internally identified by unique integer numbers (starting with zero). These numbers are not stored in data sets with sequential access, but are used in data sets with direct random access to identify records.

Data sets in Febrl need to be initialised. This merely means that the Febrl program needs to do whatever is necessary to prepare to read or write records to the data sets. A data set can be initialised in one of several access modes, and all further access is restricted to this access mode (for example, a data set initialised for read access can not be written to). This allows one to have input data sets with read access only, temporary data sets with read and write access, and output data data sets with write or append access restrictions.

The following access modes are supported and can be set using the argument access_mode when a data set is initialised.

read
For reading records from a data set. If an empty data set is initialised in read mode an error is triggered.
write
For writing to a data set (no reading is possible). If a data set already exists, its content will be erased first. It is not possible to read records from a data set initialised in write access mode.
appending
For appending records to a data set (i.e. if it already contains records they will not be erased). It is not possible to read records from a data set initialised in append access mode. Note that append mode can only be used with sequential access data sets.
readwrite
For both reading from and writing to a data set. This access mode is only possible with direct random access data sets.

Each record in a data set comprises one or more fields. A Python dictionary with the field names as keys needs to be given when a data set is initialised, as can be seen in the examples in the following subsections. The values in this dictionary depend on the data set type (e.g. for a CSV data set the column numbers need to be given, while for an SQL data set the database column names are needed). For both the shelve and memory data sets only the field names are used, so the values in the fields dictionary can be anything.

It is not necessary to define all fields in a data set. Only the ones that are used in a data cleaning and standardisation or in a record linkage process need to be listed. For example, a data set can contain names and addresses of patients, plus many medical attributes which are not used in a record linkage project (see the various examples throughout this chapter), in which case only the used name and address fields need to be defined in the fields dictionary.

When reading from a data set, records are returned as Python dictionaries, i.e. key:value pairs, where the keys are the field names and the values the corresponding values read from the data set. Note that only non-empty fields are returned, thus such a record dictionary can be quite small even if a data set has many fields.

Similarly, when writing records into a data set, they also have to be in a dictionary with the field names in the data set as keys. For fields that are not stored in a dictionary, a fields_default value will be written into the data set (for sequential access data sets only).

When reading from a data set, for each record two hidden fields are returned, namely _rec_num_ (the record identifier number) and _dataset_name_ (the name of the data set the record is read from). Using these hidden fields a record can be uniquely identified. Note that for direct access data sets the hidden field _rec_num_ is used as a record identifier when records are written into such a data set.

All data set implementations have the following attributes, which can be set with the corresponding key word (argument) when a data set is initialised.

name
The name of the data set. This should be a short string only.
description
A longer description of the data set. Note that this argument is not mandatory.
access_mode
A data set can be initialised in one of the four access modes 'read', 'write', 'append' (sequential access data sets only) or 'readwrite' (direct random access data sets only) as explained above.
fields
A dictionary with the fields (columns, attributes) in the data set. The field names must be the keys in this dictionary.
fields_default
Default string if a field is not found (for writing records into a sequential data set). The default value is an empty string.
missing_values
A list of strings that correspond to missing values in a data set. This list is used in the calculation of matching weights in the record linkage process, i.e. if a value in a record is equal to a missing value as defined in this missing_values list, the corresponding matching weight is set to a missing value weight (see Section 9.2 for more details).

Sequential data sets (i.e. COL, CSV and SQL) have the following additional attribute that can be set when a data set is initialised.

strip_fields
A flag that can be set to True or False. If set to True (the default value), all fields are stripped of their whitespaces before they are written into a data set or read from a data set. If set to False, whitespaces are not stripped off field values.

Two attributes that should be read only (one of them for sequential access data sets only) are

num_records
The total number of records in a data set. Note that the value of this attribute will be for sequential access data sets if initialised in write access mode. For all other access modes, the number of stored records in the data set is calculated, which might take a while (depending on the data set implementation).
next_record_num
This attribute is only available in sequential access data sets. It is the record number of the next record to be accessed. When a data set is initialised, the value of this attribute will be , and after having read one record it will be . If a block of records is read, it will be the number of the next record in the data set.

Besides initialisation, all data set implementations have the following methods to read or write records.

read_record() (sequential access data sets)
read_record(record_number) (direct random access data sets)
For reading one single record. For sequential data sets, the record with the next_record_num number is returned (and the value of next_record_num increased by one). If no more records are available, None is returned.
For direct random access a record number must be given to this method and the record with the corresponding _rec_num_ is returned. If the record with the given number is not in the data set, None is returned and a warning message is triggered.
read_records(first_record, number_records) (sequential access data sets)
read_records(record_number_list) (direct random access data sets)
For sequential access data sets, a block of records can be read using this method, starting from the record with number first_record to (including) first_record+number_records-1. After a block of records is read successfully, the value of next_record_num is set to the number of the record following the block. Records are returned in a list.
For direct random access data sets, a list of record numbers must be given to this method, and the corresponding records are returned in a list. The list of returned records can be shorter than the list of record numbers if a record in the list is not in the data set and thus can not be read.
write_record(record)
Write one record into a data set. For sequential access data sets, the record is written to the end of the data set (so the hidden field _rec_num_ is not used), while for direct random access data sets the record written with the value of _rec_num_ as identifier (if there was already a record with this number in the data set it is overwritten). For sequential data sets, the value of num_records is increased by one in any case, while for direct random access data sets it is only increased if previously no record was stored in the data sets with the given record number.
write_records(record_list)
Write the list of records into the data set. For sequential access data sets, the records are written to the end of the data set, and the value of num_records is increased by the number of records written, while for direct random access data sets the hidden fields _rec_num_ are used to write records into the data set. The value of num_records with direct random access data sets is only increased by the number of records with new record numbers.
finalise()
With this routine the access to a data set can be finalised, i.e. depending on the implementation a connection to the database is closed or an underlying file is closed. After a call to this routine no data set access is possible, but the data set can be re-initialised with a new (or the same) access mode. Note that for a memory based data set all data is lost when it is finalised.

Direct random access data sets have another method that allows them to be re-initialise.

re_initialise(access_mode) (direct random access data sets)
This method can be used to finalise and re-initialise a random access data set, which is needed for parallel runs of Febrl where for example only the host process writes standardised records into a temporary data set in the cleaning and standardisation phase (see note below), but all processes need to be able to read these cleaned records in the linkage or deduplication phase. So, a re-initialisation can be used to synchronise the random data sets for all processes (which for a memory based data set means it has to be copied to all processes).
This method can be called with or without a new access mode. If no access_mode argument is given, the data set will be initialised in the same access mode. Alternatively a new access mode can be given.
Depending on the setting for parallel write access (see note below), random access data sets should be re-initialised in read access mode so that all processes are able to read all the records in such a data set.

Note: When Febrl is run in parallel, two different forms of write data set access are possible. First, only the host process (where Febrl has been started on) is writing into data sets, or secondly, all processes are writing into (local) data sets. When initialising a Febrl project object (as described in Chapter 5), it is possible to configure this parallel write access mode by setting the argument parallel_write to either 'host' (the default) or 'all'. In the case where all processes are writing, each of them will write into its own (local) data sets. This is done by adding the process number (for example P0 for the host, P1, P2, etc. for the other processes) to the file names. For the SQL data set only the 'host' write access mode is currently supported.

In the following sections details on the data set implementations and their specific attributes are given.

Footnotes

...MySQL ^13.1: see http://www.mysql.com