6.6 Date Cleaning and Standardisation

The cleaning and standardisation of date components is undertaken using user-configurable rules. Examples of dates in administrative data sets are date of birth, injury dates, hospital admission dates etc. Dates of birth are often recorded with high accuracy, and they can be checked to some degree if an age field is also available. The routines related to date standardisation are implemented in the date.py module.

The aim of date standardisation is to split a given date string into a valid numerical triplet [day,month,year]. The date parsing routine consists of an initial cleaning phase, where leading and trailing whitespaces are removed from the input string and various separator strings are replaced by one whitespace.

Date parsing is done using date format strings. A format string must consist of three format directives, one each for the day, month and year component. The following directives are supported:

%b: For abbreviated month names (Jan, Feb, Mar, etc.)
%B: For full month names (January, February, etc.)
%d: For day of the month as a decimal number between 1 and maximal 31 (depending on the month, and for February if a year is a leap year or not).
%m: For month as a decimal number between 1 and 12.
%y: For year without century as a decimal number, i.e. between 00 and 99 (two digits).
%Y: For year with century as a decimal number (four digits).
%U: For unknown days. If days are unknown a value of 'unk' or 'unkown' can be parsed.

While a format string '%d %m %Y' matches all dates that start with a day number, followed by a month number and then a four digit year number, a format string '%b %d %y' matches for example 'Aug 9 02'. Date format strings are possible with and without space separators between the directives. If no spaces are given the %b and %B directives are not possible (only numerical variations of day, month and year are allowed without spaces between them). A list of date parsing format strings needs to be defined in the project.py module as shown in the example in Chapter 5.

The date parsing routine takes the first format string and tries to parse a given date input string using this format. If it fails, the second format string is tried, and this process is repeated until the given date input string could be parsed, or no more format string are available in the list, in which case an error is returned. Thus, the order in which date format strings are listed is important. The user should order this list according to the format in which dates will most likely be represented in the input data sets.

If only a two-digit year has been parsed, it is expanded into a four-digit year using a pivot year that is an argument to a date standardiser as shown below. The pivot year separates the range of two-digit years 00-99 into two parts, one that is expanded with 19xx (year numbers equal to and after the pivot year), the other with 20xx (year numbers smaller than the pivot year). For example, if the pivot year is set to 04, a two-digit year value of 68 is expanded into 1968, a year value of 04 is expanded into 1904 but a year value of 03 is expanded into 2003.

When a date standardiser is initialised, the following arguments need to be given.

name
A name for the date standardiser. Most suitable is a short string.
description
A longer description of the date standardiser. Note that this argument is not mandatory, the standardisation process works fine without a description.
input_fields
A string with a field name or a list of field names from the input data set, that will be standardised. If a list is given the field values will be concatenated into one string before the parsing and standardisation is done.
output_fields
A list of three field names as defined in the output data set. The date component standardiser returns standardised dates in these three fields, with the day value in the first field, the month value in the second and the year value in the third field. It is possible to set fields to None if no output is to be written to a field (for example if one is not interested in the day values), as long as at least one output field is defined (not None).
parse_formats
A string with a date parse format or a list with date parsing format strings as described above.
pivot_year
Value of the pivot year (between 00 and 99) that controls expansion of two-digit year values into four-digit year values as described above. A two-digits year value xx smaller than the pivot year will be expanded into 20xx, and year values larger and equal than the pivot year will be expanded into 19xx. The default value is the current year plus one.

The following example code shows how a list of date parsing format strings and two different date standardisers are initialised. They have different input fields and the parsed dates are standardised into different output dates. Note that for the mother standardiser only the year output field is used, both the day and month values are set to None and thus discarded.

# ====================================================================

date_parse_formats = ['%d %m %Y',   # 24 04 2002  or  24 4 2002
                      '%d %B %Y',   # 24 Apr 2002 or  24 April 2002
                      '%m %d %Y',   # 04 24 2002  or  4 24 2002
                      '%B %d %Y',   # Apr 24 2002 or  April 24 2002
                      '%Y %m %d',   # 2002 04 24  or  2002 4 24
                      '%Y %B %d',   # 2002 Apr 24 or  2002 April 24
                      '%Y%m%d',     # 20020424  ** ISO standard **
                      '%d%m%Y',     # 24042002
                     ]

baby_std = DateStandardiser(name = 'Baby-dob-std',
                     description = 'Baby date of birth standardiser',
                    input_fields = 'bdob',
                   output_fields = ['baby_day','baby_month',
                                    'baby_year'],
                   parse_formats = date_parse_formats)

mother_std = DateStandardiser(name = 'Mother-dob-std',
                       description = 'Mother year of birth stand.',
                      input_fields = 'mdob',
                     output_fields = [None, None, 'mother_year'],
                     parse_formats = date_parse_formats,
                        pivot_year = 04)