13. Data Set Access

Data sets can be available in many different formats, often stored either as portable text files or in databases, but sometimes also in efficient non-portable binary files, or in application specific file formats. Within Febrl, the module dataset.py provides the means to access various data set formats. The module has been written to make it easy to add additional methods to read new data formats in the future.

Generally, one can distinguish between data sets that only allow sequential record access, and data sets that allow direct random access to any record stored (using a unique record identifier). While sequential data sets are mainly used for persistent storage of data sets (like in text files or databases), direct random access data sets are very useful for temporary storage - for example after records have been cleaned and standardised and before they are used in a deduplication or linkage process.

The Febrl system currently supports the following five data set implementations.

Note: Another data set implementation - named PQSQL - has been made available by an external contributer for accessing the open source database PostgreSQL.. It's interface is very similar to the SQL data set implementation. For more details please see the module dataset.py.

Records within a data set in Febrl are internally identified by unique integer numbers (starting with zero). These numbers are not stored in data sets with sequential access, but are used in data sets with direct random access to identify records.

Data sets in Febrl need to be initialised. This merely means that the Febrl program needs to do whatever is necessary to prepare to read or write records to the data sets. A data set can be initialised in one of several access modes, and all further access is restricted to this access mode (for example, a data set initialised for read access can not be written to). This allows one to have input data sets with read access only, temporary data sets with read and write access, and output data data sets with write or append access restrictions.

The following access modes are supported and can be set using the argument access_mode when a data set is initialised.

Each record in a data set comprises one or more fields. A Python dictionary with the field names as keys needs to be given when a data set is initialised, as can be seen in the examples in the following subsections. The values in this dictionary depend on the data set type (e.g. for a CSV data set the column numbers need to be given, while for an SQL data set the database column names are needed). For both the shelve and memory data sets only the field names are used, so the values in the fields dictionary can be anything.

It is not necessary to define all fields in a data set. Only the ones that are used in a data cleaning and standardisation or in a record linkage process need to be listed. For example, a data set can contain names and addresses of patients, plus many medical attributes which are not used in a record linkage project (see the various examples throughout this chapter), in which case only the used name and address fields need to be defined in the fields dictionary.

When reading from a data set, records are returned as Python dictionaries, i.e. key:value pairs, where the keys are the field names and the values the corresponding values read from the data set. Note that only non-empty fields are returned, thus such a record dictionary can be quite small even if a data set has many fields.

Similarly, when writing records into a data set, they also have to be in a dictionary with the field names in the data set as keys. For fields that are not stored in a dictionary, a fields_default value will be written into the data set (for sequential access data sets only).

When reading from a data set, for each record two hidden fields are returned, namely _rec_num_ (the record identifier number) and _dataset_name_ (the name of the data set the record is read from). Using these hidden fields a record can be uniquely identified. Note that for direct access data sets the hidden field _rec_num_ is used as a record identifier when records are written into such a data set.

All data set implementations have the following attributes, which can be set with the corresponding key word (argument) when a data set is initialised.

Sequential data sets (i.e. COL, CSV and SQL) have the following additional attribute that can be set when a data set is initialised.

Two attributes that should be read only (one of them for sequential access data sets only) are

Besides initialisation, all data set implementations have the following methods to read or write records.

Direct random access data sets have another method that allows them to be re-initialise.

Note: When Febrl is run in parallel, two different forms of write data set access are possible. First, only the host process (where Febrl has been started on) is writing into data sets, or secondly, all processes are writing into (local) data sets. When initialising a Febrl project object (as described in Chapter 5), it is possible to configure this parallel write access mode by setting the argument parallel_write to either 'host' (the default) or 'all'. In the case where all processes are writing, each of them will write into its own (local) data sets. This is done by adding the process number (for example P0 for the host, P1, P2, etc. for the other processes) to the file names. For the SQL data set only the 'host' write access mode is currently supported.

In the following sections details on the data set implementations and their specific attributes are given.


 see http://www.mysql.com