Field comparator functions need to be initialised (or
constructed) in order to be able to use them. In the following
examples, we assume that a data set mydata_1 has the fields
givenname, surname, age and postcode, and
a second data set mydata_2 has the fields gname,
sname, dob, mob, yob (day, month and year
of birth) and pcode. Additionally, a frequency table
surname_freq is available and has been loaded. The following
examples illustrate how to set up different field comparators that
can be used to compute weight vectors using a record comparator (which
is initialised at the end of the example code).
# ====================================================================
surname_exact =
FieldComparatorExactString(fields_a='surname',
fields_b='sname',
m_prob=0.95, u_prob=0.001,
missing_weight=0.0,
frequency_table=surname_freq,
freq_table_max_weight=20.0,
freq_table_min_weight=-20.0)
surname_jaro =
FieldComparatorApproxString(fields_a='surname',
fields_b='sname',
m_prob=0.95, u_prob=0.001,
missing_weight=0.0,
frequency_table=surname_freq,
freq_table_max_weight=20.0,
freq_table_min_weight=-20.0,
compare_method='jaro')
givenname_trunc =
FieldComparatorTruncateString(fields_a='givenname',
fields_b='gname',
m_prob=0.90, u_prob=0.02,
missing_weight=0.0,
max_string_len=4)
postcode_keydiff =
FieldComparatorKeyDiff(fields_a='postcode',
fields_b='pcode',
m_prob=0.98, u_prob=0.001,
missing_weight=0.0,
max_key_diff=1)
postcode_distance =
FieldComparatorDistance(fields_a='postcode',
fields_b='pcode',
m_prob=0.98, u_prob=0.001,
missing_weight=0.0,
geocode_table=postcode_geocode,
max_distance=42.0)
age = FieldComparatorAge(fields_a='age',
fields_b=['dob','mob','yob'],
m_probability_day=0.9,
u_probability_day=0.01,
m_probability_month=0.98,
u_probability_month=0.0001,
m_probability_year=0.95,
u_probability_year=0.001,
missing_weight=0.0,
max_perc_diff=10,
fix_date='20000101')
field_comparisons =
[surname_exact, surname_jaro, givenname_trunc, postcode_keydiff,
postcode_distance, age]
record_comparator =
RecordComparator(mydata_1, mydata_2, field_comparisons)