12.4 Shelve Data Set Implementation

 

The shelve data set uses the Python standard module  shelve.py, which provides a file-based hash table (dictionary) that allows efficient storage and access of arbitrary records. Thus, a shelve data set becomes an efficient and convenient data set implementation for temporary persistent storage of records. This data set implementation is for direct random access only.

Note: The Python shelve module is based on a database like dbm, gdbm or bsddb. Unfortunately, these database libraries seem to be badly broken when used within Python 2.2 (i.e. they crash when trying to load several thousand records into a shelve). Starting with Febrl version 0.2.1 we are therefor supporting the use of the external module bsddb3 which can be downloaded from http://pybsddb.sourceforge.net. The Berkeley database library itself is available under an open source software license from Sleepycat Software at http://www.sleepycat.com. The shelve data set implementation automatically detects if the bsddb3 module is installed or not, and will use it if it is available.

The fields attribute of a shelve data set must be a dictionary where the keys are the field names. The corresponding values are not used and can thus be anything, e.g. an empty string or an integer number. They are not needed to access records or the fields within records.

Two additional attributes (besides the general data set attributes as described above) for a shelve data set are

The following example shows how to initialise a shelve data set and how to access it in read/write mode. It is assumed that the dataset.py module has been imported using the import dataset command.

# ====================================================================

mydata = dataset.DataSetShelve(name = 'hospital-data',
                        description = 'Hospital data from 1990-2000',
                       access_right = 'readwrite',
                          file_name = 'hospital',
                              clear = True,
                             fields = {'year':'',
                                       'surname':'',
                                       'givenname':'',
                                       'dob':'',
                                       'address':'',
                                       'postcode':'',
                                       'state':''},
                     fields_default = '',
                     missing_values = ['','missing'])

print mydata.num_records  # Print total number for records

first_record = {'surname':'miller','givenname':'peter','state':'act',
                '_rec_num_':0}

mydata.write_record(first_record)

more_records = [{'surname':'smith','givenname':'dave','dob':'1966',
                 '_rec_num_':1},
                {'surname':'winkler','givenname':'harry',
                 '_rec_num_':42},
                {'surname':'paul','postcode':'2100','state':'nsw',
                 '_rec_num_':0}]

mydata.write_records(more_records)

print mydata.num_records  # Print total number for records (3)

record = mydata.read_record(42)

record_list = mydata.read_records([0,1])

mydata.re_initialise()  # Re-initialise data set in readwrite mode

record = mydata.read_record(1)

mydata.write_record(first_record)
mydata.re_initialise('read')  # Re-initialise data set in read mode

record_list2 = mydata.read_records([42,0,1])

mydata.finalise()  # Close file, finalise access to data set

Note that record numbers (values in the hidden field _rec_num_) do not necessarily need to be in a consecutive range, as shown in the example above. Also, records can be overwritten at any time if a new record with an already existing record number value is written to the data set.