The cleaning and standardisation of date components is undertaken using user-configurable rules. Examples of dates in administrative data sets are date of birth, injury dates, hospital admission dates etc. Dates of birth are often recorded with high accuracy, and they can be checked to some degree if an age field is also available. The routines related to date standardisation are implemented in the date.py module.
The aim of date standardisation is to split a given date string into a
valid numerical triplet [day,month,year]
. The date parsing
routine consists of an initial cleaning phase, where leading and
trailing whitespaces are removed from the input string and various
separator strings are replaced by one whitespace.
Date parsing is done using date format strings. A format string must consist of three format directives, one each for the day, month and year component. The following directives are supported:
%b
%B
%d
%m
%y
%Y
%U
'unk'
or 'unkown'
can be parsed.
'%d %m %Y'
matches all dates that
start with a day number, followed by a month number and then a four
digit year number, a format string '%b %d %y'
matches for
example 'Aug 9 02'
. Date format strings are possible with and
without space separators between the directives. If no spaces are
given the %b
and %B
directives are not possible (only
numerical variations of day, month and year are allowed without spaces
between them). A list of date parsing format strings needs to be
defined in the project.py module as shown in the example in
Chapter 5.
The date parsing routine takes the first format string and tries to parse a given date input string using this format. If it fails, the second format string is tried, and this process is repeated until the given date input string could be parsed, or no more format string are available in the list, in which case an error is returned. Thus, the order in which date format strings are listed is important. The user should order this list according to the format in which dates will most likely be represented in the input data sets.
If only a two-digit year has
been parsed, it is expanded into a four-digit year using a pivot
year that is an argument to a date standardiser as shown below. The
pivot year separates the range of two-digit years 00-99 into
two parts, one that is expanded with 19xx (year numbers equal
to and after the pivot year), the other with 20xx (year numbers
smaller than the pivot year). For example, if the pivot year is set to
04
, a two-digit year value of 68
is expanded into
1968
, a year value of 04
is expanded into 1904
but a year value of 03
is expanded into 2003
.
When a date standardiser is initialised, the following arguments need to be given.
name
description
input_fields
output_fields
None
if no output is to be written to a
field (for example if one is not interested in the day values), as
long as at least one output field is defined (not None
).
parse_formats
pivot_year
The following example code shows how a list of date parsing format
strings and two different date standardisers are initialised. They
have different input fields and the parsed dates are standardised into
different output dates. Note that for the mother standardiser only the
year output field is used, both the day and month values are set to
None
and thus discarded.
# ==================================================================== date_parse_formats = ['%d %m %Y', # 24 04 2002 or 24 4 2002 '%d %B %Y', # 24 Apr 2002 or 24 April 2002 '%m %d %Y', # 04 24 2002 or 4 24 2002 '%B %d %Y', # Apr 24 2002 or April 24 2002 '%Y %m %d', # 2002 04 24 or 2002 4 24 '%Y %B %d', # 2002 Apr 24 or 2002 April 24 '%Y%m%d', # 20020424 ** ISO standard ** '%d%m%Y', # 24042002 ] baby_std = DateStandardiser(name = 'Baby-dob-std', description = 'Baby date of birth standardiser', input_fields = 'bdob', output_fields = ['baby_day','baby_month', 'baby_year'], parse_formats = date_parse_formats) mother_std = DateStandardiser(name = 'Mother-dob-std', description = 'Mother year of birth stand.', input_fields = 'mdob', output_fields = [None, None, 'mother_year'], parse_formats = date_parse_formats, pivot_year = 04)