13.5 Memory Data Set Implementation

The memory data set uses the Python dictionary data structure - which are basically hash tables - to temporarily store records in main memory. This data set allows direct random access. This is an efficient temporary storage for records that do not need to be made persistent, e.g. for cleaned and standardised records before they are linked with another data set.

The only possible access mode for this data set implementation is of course 'readwrite' (as the 'read' or 'write' only access modes would prohibit the use of memory based data sets).

The fields attribute of a memory data set must be a dictionary where the keys are the field names. The corresponding values are not used and can thus be anything, e.g. an empty string or an integer number. They are not needed to access records or the fields within records.

Note that all records are lost when a memory data set is finalised or a program finishes. Use other data sets for persistent storage.

Besides the general data set attributes as described above no additional attribute (like file name) is needed.

The following example shows how to initialise a memory data set and how to access it in read/write mode. It is assumed that the dataset.py module has been imported using the

import
dataset

command.

# ====================================================================

mydata = dataset.DataSetMemory(name = 'hospital-tmp-data',
                        description = 'Hospital data from 1990-2000',
                       access_right = 'readwrite',
                          file_name = 'hospital',
                             fields = {'year':'',
                                       'surname':'',
                                       'givenname':'',
                                       'dob':'',
                                       'address':'',
                                       'postcode':'',
                                       'state':''},
                     fields_default = '',
                       strip_fields = True,
                     missing_values = ['','missing'])

print mydata.num_records  # Print total number for records (0)

first_record = {'surname':'miller','givenname':'peter','state':'act',
                '_rec_num_':0}

mydata.write_record(first_record)

more_records = [{'surname':'smith','givenname':'dave','dob':'1966',
                 '_rec_num_':1},
                {'surname':'winkler','givenname':'harry',
                 '_rec_num_':42},
                {'surname':'paul','postcode':'2100','state':'nsw',
                 '_rec_num_':0}]

mydata.write_records(more_records)

print mydata.num_records  # Print total number for records (3)

record = mydata.read_record(42)

record_list = mydata.read_records([0,1])

mydata.finalise()  # Finalise access to data set, all records are lost

Note that record numbers (values in the hidden field _rec_num_) not necessarily need to be in a consecutive range, as shown in the example above. Also, records can be overwritten at any time if a new record with an already existing record number value is written to the data set.