Several output forms are possible with the current Febrl version, with more to be included in the future. Currently, the three possible output forms are
It is possible to select any combination of these three output forms. Additionally, it is possible to activate a one-to-one assignment restriction procedure before the output is generated (but after record pairs were classified). This will restrict each record in a data set to be linked to maximal one record in the other data set (or for a deduplication, a record can be a duplicate of only one other record). See Section 10.1 below for more details on this topic.
Setting the desired output forms as well as activating an assignment
restriction procedure needs to be done with the appropriate arguments
when the linkage or deduplication process is defined (see
Section 9.6). The following output related
arguments need to be defined within a deduplication
or
link
method.
output_histogram
True
(in which case a
histogram is printed - or displayed - in the terminal window),
to a file name (a string) (in which case the histogram is
written into this file), or to False
(in which case no
histogram is printed or saved).
Such a histogram is made of simple characters, with the weights
(starting with low values) being on the vertical axis. For each
weight, a bar is displayed horizontally indicating the number of
record pairs with this weight, as shown in the following example
histogram.
Weight histogram: ----------------- 0 ***** 47 1 ******************** 199 2 ************ 122 3 **** 34 4 ***** 48 5 ******** 81 6 ***************** 178 7 ************************* 256 8 **************** 165 9 **** 41 10 ** 38
output_rec_pair_details
True
, to a file name or to
False
. If set to True
all record pairs with a
weight larger than output_threshold
(if defined, see
below) are printed (displayed) in the terminal window. If set to
a file name (a string) then the record pairs will be written
into this text file. Each record pair is printed or saved in a
three column format, with the field names in the first column,
then the values of the first record in the medium column, and
the values of the second record in the third column. The first
line for a record pair is the total weight, and the second line
are the two record identifiers (made of the data set names and
the record numbers), followed by a number of lines containing
the record fields. If a record pair was assigned in the
assignment procedure, it will be displayed/saved accordingly
(with the string '[assigned]'
), as shown in the example
output below (note that names and addresses are selected
randomly and are not part of a real world data set).
Resulting record pairs: ----------------------- Output threshold: 10.000000 Data set A: example4a-tmp Data set B: example4b-tmp ----------------------------------------------------------------------------- Weight: 64.378553 [assigned] Fields | [RecID A: 4973/example4a-tmp] | [RecID B: 330/example4b-tmp] address_hmm_ | 1.71249174813e-09 | 1.61564691479e-11 dob_day | 28 | 28 dob_month | 12 | 12 dob_year | 1939 | 1939 gender_guess | female | female given_name | bridget | bridget locality_nam | parkinson | parkinson postcode | 2705 | 2705 rec_id | rec-3023-org | rec-3023-dup-0 soc_sec_id | 3815665 | 3815665 surname | hand | hand territory | new_south_wales | new_south_wales wayfare_name | bunton | bunton wayfare_numb | 210 | 210 wayfare_type | place | place ----------------------------------------------------------------------------- Weight: 40.274501 Fields | [RecID A: 3228/example4a-tmp] | [RecID B: 763/example4b-tmp] address_hmm_ | 6.42884838319e-07 | 6.42884838319e-07 dob_day | 11 | 11 dob_month | 11 | 11 dob_year | 1952 | 1952 gender_guess | male | given_name | edward | edwafd locality_nam | lykabetos cabramatta | bobblegigbie cabramatta postcode | 3201 | 3201 rec_id | rec-3179-org | rec-3179-dup-0 soc_sec_id | 5673646 | 5673646 surname | marshman | marshman territory | western_australia | western_australia wayfare_name | cambridge | cambridge wayfare_numb | 4 | 852 wayfare_type | street | street
output_rec_pair_weights
True
, to a file name
(a string) or to False
. If set to True
or to a
file name, all record pairs with a weight larger than
output_threshold
(if defined) are displayed (or written
into the file). Only the record numbers and the corresponding
total weight will be displayed or saved. Additionally, if a
record pair has been assigned in the assignment procedure, the
string '[assigned]'
will be displayed/saved as well.
As can be seen in the example below, the first column are the
record identification numbers from the first data set and the
second column contains the record identification numbers of the
matching records in the second data set.
Resulting record pairs: ----------------------- Output threshold: 10.000000 Data set A: example4a-tmp Data set B: example4b-tmp 0 1449: 19.9035825896 [assigned] 1 2750: 40.2745006027 [assigned] 2 4656: 45.0119233965 [assigned] 777: 11.5367909784 3 4119: 59.6411298934 [assigned] 3306: 49.7493461902 5 2305: 24.3405844039 3944: 23.4012517298 [assigned] 6 2801: 54.486768984 [assigned] 3289: 11.2032464488 2743: 10.8222053532
output_threshold
None
all record pairs stored
in the results data structure (as calculated by the classifier)
will be printed or saved. Note that only the record pairs with
matching weights larger than the lower threshold (i.e. matches
and possible matches) are stored in this results data structure.
However, this can still be a huge number of record pairs, many
having a low weight. By setting this output threshold to a
positive number (e.g. equal to the upper threshold as defined
in a Fellegi and Sunter classifier) one can restrict the volume
of the output. With the exception of the histogram, all other
output forms (as well as assignment procedures) will be
restricted to record pairs with weights that are equal to or
larger than the output threshold.
output_assignment
'one2one'
. See
Section 10.1 for more information on assignment
restrictions.
An example of how to use these arguments when defining a linkage or deduplication process can be seen in Chapter 5 and Section 9.6.