This program can be used to analyse a data file (currently a text file in comma separated values (CSV) only) in order to get information about its content and quality. Frequency information is collected for all columns, and various statistics are then printed and stored in a statistics results file.
The fileanalysis.py program can be started from the command line with the following argument list
python fileanalysis.py
input_file steps
num_header_lines num_columns stats_file
The needed arguments are
Three more settings can be configured within the fileanalysis.py program itself.
MISS_PERC_THRES
MISS_PERC_THRES
are assumed not to be suitable for
blocking and thus no estimated number of record pairs will be
calculated for such columns. Default setting is .
QUANT_LIST
MAX_VALUES
MAX_VALUES
, only the most and least frequent values will
be printed. Default value for MAX_VALUES
is 13.
For each column in the input file, the following information and statistics are calculated and printed.
QUANT_LIST
.
MAX_VALUES
different values,
all these values will be printed together with their frequency.
Otherwise, only the most and least frequent values will be
printed (with their frequencies).
Finally three summary tables will be printed (and stored in the results file if one is given in the arguments list). The first summary table contains the column names (if a header line is given), the number of different values in the columns, the number of records (or lines) in the columns with a missing (or empty) value, the average frequency of the values and their standard deviation, and the type (digits only, numbers only, digit and numbers, or various characters).
The second summary table presents the column names (if a header line
is given) and their quantiles as defined in the QUANT_LIST
.
The third summary table presents the column names and calculations on
the suitability of the columns for blocking. If a column contains less
than MISS_PERC_THRES
(in percentage) records with missing (or
empty) values then an expected number of record pairs is calculated by
summing over the frequencies of all values in a column:
'ident_num'
contains values that occur only once (i.e. with
maximum frequency 1), therefore the calculated number of record pair
comparisons is zero.
Column names Suitability ---------------------------------------------------------------------------------- year 11 diff. values, resulting in 84272215518 record pair comparisons ident_num 962776 diff. values, resulting in 0 record pair comparisons surname 101301 diff. values, resulting in 1608892180 record pair comparisons Note column contains 2.23% (21483) records with missing values wayfarenum 16.46% (158478) records with missing values (not suitable)