This program can be used to analyse a data file (currently a text file in comma separated values (CSV) only) in order to get information about its content and quality. Frequency information is collected for all columns, and various statistics are then printed and stored in a statistics results file.
The fileanalysis.py program can be started from the command line with the following argument list
python fileanalysis.pyinput_file steps num_header_lines num_columns stats_file
The needed arguments are
Three more settings can be configured within the fileanalysis.py program itself.
MISS_PERC_THRESare assumed not to be suitable for blocking and thus no estimated number of record pairs will be calculated for such columns. Default setting is .
MAX_VALUES, only the most and least frequent values will be printed. Default value for
For each column in the input file, the following information and statistics are calculated and printed.
MAX_VALUESdifferent values, all these values will be printed together with their frequency. Otherwise, only the most and least frequent values will be printed (with their frequencies).
Finally three summary tables will be printed (and stored in the results file if one is given in the arguments list). The first summary table contains the column names (if a header line is given), the number of different values in the columns, the number of records (or lines) in the columns with a missing (or empty) value, the average frequency of the values and their standard deviation, and the type (digits only, numbers only, digit and numbers, or various characters).
The second summary table presents the column names (if a header line
is given) and their quantiles as defined in the
The third summary table presents the column names and calculations on
the suitability of the columns for blocking. If a column contains less
MISS_PERC_THRES (in percentage) records with missing (or
empty) values then an expected number of record pairs is calculated by
summing over the frequencies of all values in a column:
'ident_num'contains values that occur only once (i.e. with maximum frequency 1), therefore the calculated number of record pair comparisons is zero.
Column names Suitability ---------------------------------------------------------------------------------- year 11 diff. values, resulting in 84272215518 record pair comparisons ident_num 962776 diff. values, resulting in 0 record pair comparisons surname 101301 diff. values, resulting in 1608892180 record pair comparisons Note column contains 2.23% (21483) records with missing values wayfarenum 16.46% (158478) records with missing values (not suitable)