Data sets can be available in many different formats, often stored either as portable text files or in databases, but sometimes also in efficient non-portable binary files, or in application specific file formats. Within Febrl, the module dataset.py provides the means to access various data set formats. The module has been written to make it easy to add additional methods to read new data formats in the future.
Generally, one can distinguish between data sets that only allow sequential record access, and data sets that allow direct random access to any record stored (using a unique record identifier). While sequential data sets are mainly used for persistent storage of data sets (like in text files or databases), direct random access data sets are very useful for temporary storage - for example after records have been cleaned and standardised and before they are used in a deduplication or linkage process.
The Febrl system currently supports the following five data set implementations.
COL
CSV
SQL
Shelve
Memory
Records within a data set in Febrl are internally identified by unique integer numbers (starting with zero). These numbers are not stored in data sets with sequential access, but are used in data sets with direct random access to identify records.
Data sets in Febrl need to be initialised. This merely means that the Febrl program needs to do whatever is necessary to prepare to read or write records to the data sets. A data set can be initialised in one of several access modes, and all further access is restricted to this access mode (for example, a data set initialised for read access can not be written to). This allows one to have input data sets with read access only, temporary data sets with read and write access, and output data data sets with write or append access restrictions.
The following access modes are supported and can be set using the
argument access_mode
when a data set is initialised.
read
write
appending
readwrite
Each record in a data set comprises one or more fields
. A
Python dictionary with the field names as keys needs to be given when
a data set is initialised, as can be seen in the examples in the
following subsections. The values in this dictionary depend on the
data set type (e.g. for a CSV data set the column numbers need to be
given, while for an SQL data set the database column names are
needed). For both the shelve and memory data sets only the field names
are used, so the values in the fields dictionary can be anything.
It is not necessary to define all fields in a data set. Only the ones that are used in a data cleaning and standardisation or in a record linkage process need to be listed. For example, a data set can contain names and addresses of patients, plus many medical attributes which are not used in a record linkage project (see the various examples throughout this chapter), in which case only the used name and address fields need to be defined in the fields dictionary.
When reading from a data set, records are returned as Python
dictionaries, i.e. key:value
pairs, where the keys are the
field names and the values the corresponding values read from the data
set. Note that only non-empty fields are returned, thus such a record
dictionary can be quite small even if a data set has many fields.
Similarly, when writing records into a data set, they also have to be
in a dictionary with the field names in the data set as keys. For
fields that are not stored in a dictionary, a fields_default
value will be written into the data set (for sequential access data
sets only).
When reading from a data set, for each record two hidden fields
are returned, namely _rec_num_
(the record identifier
number) and _dataset_name_
(the name of the data set the
record is read from). Using these hidden fields a record can be
uniquely identified. Note that for direct access data sets the hidden
field _rec_num_
is used as a record identifier when records
are written into such a data set.
All data set implementations have the following attributes, which can be set with the corresponding key word (argument) when a data set is initialised.
name
description
access_mode
'read'
, 'write'
, 'append'
(sequential
access data sets only) or 'readwrite'
(direct random
access data sets only) as explained above.
fields
fields_default
missing_values
missing_values
list, the corresponding matching weight
is set to a missing value weight (see
Section 9.2 for more details).
Sequential data sets (i.e. COL, CSV and SQL) have the following additional attribute that can be set when a data set is initialised.
strip_fields
True
or False
. If set to
True
(the default value), all fields are stripped of
their whitespaces before they are written into a data set or
read from a data set. If set to False
, whitespaces are
not stripped off field values.
Two attributes that should be read only (one of them for sequential access data sets only) are
num_records
write
access mode. For all other access
modes, the number of stored records in the data set is
calculated, which might take a while (depending on the data set
implementation).
next_record_num
Besides initialisation, all data set implementations have the following methods to read or write records.
read_record()
(sequential access data sets)
read_record(record_number)
(direct random access
data sets)
next_record_num
number is returned (and
the value of next_record_num
increased by one). If no
more records are available, None
is returned.
_rec_num_
is returned. If the record with the given number is not in the
data set, None
is returned and a warning message is
triggered.
read_records(first_record, number_records)
(sequential access data sets)
read_records(record_number_list)
(direct random
access data sets)
first_record
to (including)
first_record+number_records-1
. After a block of records
is read successfully, the value of next_record_num
is
set to the number of the record following the block. Records are
returned in a list.
write_record(record)
_rec_num_
is not used), while for direct
random access data sets the record written with the value of
_rec_num_
as identifier (if there was already a record
with this number in the data set it is overwritten). For
sequential data sets, the value of num_records
is
increased by one in any case, while for direct random access
data sets it is only increased if previously no record was
stored in the data sets with the given record number.
write_records(record_list)
num_records
is increased by the
number of records written, while for direct random access data
sets the hidden fields _rec_num_
are used to write
records into the data set. The value of num_records
with
direct random access data sets is only increased by the number
of records with new record numbers.
finalise()
Direct random access data sets have another method that allows them to be re-initialise.
re_initialise(access_mode)
(direct random access
data sets)
access_mode
argument is given, the data set will
be initialised in the same access mode. Alternatively a new
access mode can be given.
read
access mode so that all processes are able to read
all the records in such a data set.
parallel_write
to either 'host'
(the default) or
'all'
. In the case where all processes are writing, each of
them will write into its own (local) data sets. This is done by
adding the process number (for example P0
for the host,
P1
, P2
, etc. for the other processes) to the file
names. For the SQL data set only the 'host'
write access mode
is currently supported.
In the following sections details on the data set implementations and their specific attributes are given.