13.2 CSV Data Set Implementation

Text files with comma separated values are common, as they are a portable way to store data from spreadsheets or database tables. Often such files have the file extension '.csv'. This data set implementation allows sequential access only.

The fields attribute of a CSV data set must be a dictionary where the keys are the field names and the values are the corresponding column numbers (starting with ).

Additional attributes (besides the general data set attributes as described above) for a CSV data set are

file_name
A string containing the name of the underlying CSV text file.
header_lines
The number of header lines at the beginning of the CSV file that have to be skipped in read access mode. The default value is , which corresponds to no header line.
write_header
A flag that can be set to True or False. If set to True, a header line with the field names is written at the beginning of the CSV file if the data set is initialised in write mode or in append mode (if the file is empty). The default value is False, i.e. no header line will be written.
write_quote_char
It is possible to quote each field in the CSV file when records are written to the file. The default value of the quote character is an empty string, which means the fields are not quoted. A common quote character is " (double quotes).

The following example shows how to initialise a CSV data set and how to access it in read mode. It is assumed that the dataset.py module has been imported using the import dataset command.

# ====================================================================

mydata = dataset.DataSetCSV(name = 'hospital-data',
                     description = 'Hospital data from 1990-2000',
                    access_right = 'read',
                    header_lines = 1,
                       file_name = './data/hospital.csv',
                          fields = {'year':0,
                                    'surname':1,
                                    'givenname':2,
                                    'dob':12,
                                    'address':7,
                                    'postcode':8,
                                    'state':9},
                  fields_default = '',
                    strip_fields = True,
                  missing_values = ['','missing'])

print mydata.num_records  # Print total number for records

first_record = mydata.read_record()  # Returns one record

hundred_records = mydata.read_records(1000,100)  # Read 100 records
ten_records = mydata.read_records(2000,10)  # Read another 10 records

mydata.finalise()  # Close file, finalise access to data set

Note: In its current implementation, a CSV data set can only consist of one underlying CSV text file. The handling of multiple files as one data set will be implemented in a future version of Febrl.