12.3 File Analysis Program 'fileanalysis.py'

This program can be used to analyse a data file (currently a text file in comma separated values (CSV) only) in order to get information about its content and quality. Frequency information is collected for all columns, and various statistics are then printed and stored in a statistics results file.

The fileanalysis.py program can be started from the command line with the following argument list

python fileanalysis.py input_file steps num_header_lines num_columns stats_file

The needed arguments are

input_file
Name of the input file to be analysed. This has to a text file with comma separated values (CSV).
steps
Either a percentage value (e.g. 10%) or a number of lines (larger than zero). This gives the percentage of lines in the file, or the number of lines, after which intermediate statistics will be printed out. This can be helpful for very large files which might take too long to process completely.
num_header_lines
The number of header lines at the beginning of the file which will be skipped when collecting the statistics.
num_columns
The expected number of columns in the file. Warnings will be printed for each line in the file which does have a different number of columns.
stats_file
The name of a text file into which the final statistics results will be stored.

Three more settings can be configured within the fileanalysis.py program itself.

MISS_PERC_THRES
A numerical threshold (as percentage between 0 and 100). Columns with a percentage of missing values larger than MISS_PERC_THRES are assumed not to be suitable for blocking and thus no estimated number of record pairs will be calculated for such columns. Default setting is $5\%$ .
QUANT_LIST
A list containing one or more quantiles, given as values between and (inclusive). Quartiles for example could be configured by setting QUANT_LIST = [0.0,0.25,0.5,0.75,1.0]. Default are the following quantiles: QUANT_LIST = [0.0,0.05,0.25,0.5,0.75,0.95,1.0]
MAX_VALUES
The maximum number of values that will be printed for a column (see below for more details). If a column has more than MAX_VALUES, only the most and least frequent values will be printed. Default value for MAX_VALUES is 13.

For each column in the input file, the following information and statistics are calculated and printed.

The column number (starting with 0) and its name(s) taken from the header line(s) (if available in the file).
The number of different values.
The smallest and largest value (according to string sorting).
The average frequency for the values in a column, and their standard deviation.
The quantiles as defined in the QUANT_LIST.
If a column has less than MAX_VALUES different values, all these values will be printed together with their frequency. Otherwise, only the most and least frequent values will be printed (with their frequencies).
The minimum and maximum value lengths.
Three flags: (1) All column values are digits only; (2) all values are alphabetics only; and (3) all values are alphanumeric only.
The maximum number of spaces in a value.
Number of records with missing value (empty strings or whitespaces only).

Finally three summary tables will be printed (and stored in the results file if one is given in the arguments list). The first summary table contains the column names (if a header line is given), the number of different values in the columns, the number of records (or lines) in the columns with a missing (or empty) value, the average frequency of the values and their standard deviation, and the type (digits only, numbers only, digit and numbers, or various characters).

The second summary table presents the column names (if a header line is given) and their quantiles as defined in the QUANT_LIST.

The third summary table presents the column names and calculations on the suitability of the columns for blocking. If a column contains less than MISS_PERC_THRES (in percentage) records with missing (or empty) values then an expected number of record pairs is calculated by summing over the frequencies of all values in a column:

$\begin{displaymath} num\_blocks = \sum freq[i] \times (freq[i] - 1) \end{displaymath}$

A small example summary table is shown below. Note that the column 'ident_num' contains values that occur only once (i.e. with maximum frequency 1), therefore the calculated number of record pair comparisons is zero.

Column names  Suitability
----------------------------------------------------------------------------------
        year  11 diff. values, resulting in 84272215518 record pair comparisons
   ident_num  962776 diff. values, resulting in 0 record pair comparisons
     surname  101301 diff. values, resulting in 1608892180 record pair comparisons
              Note column contains 2.23% (21483) records with missing values
  wayfarenum  16.46% (158478) records with missing values  (not suitable)