10.3.1 Memory Usage and Performance of 'process-gnaf.py'

The process-gnaf.py program works by building in-memory hash table data structures for all the fields (or attributes) in the G-NAF data files, and it therefore needs a large amount of main memory and takes some processing times. For example, processing the New South Wales part of G-NAF (containing around 4 million address site records) on a SUN Enterprise 450 shared memory (SMP) server with four 480 MHz Ultra-SPARC II processors and 4 Giga Bytes of main memory used around 3,300 Mega Bytes (3.3 Giga Bytes) of main memory and took around 34.5 hours (with all processing flags in process-gnaf.py set to True). Ways to reduce the amount of memory needed are to

create the different binary index files separately (i.e. first set the save_pickle_files to True and the save_shelve_files to False, start the pre-processing, and when it is finished then change the flags, i.e. set save_pickle_files to False and set save_shelve_files to True, and restart the preprocessing),
process the different G-NAF files indexes (address, street and locality level, reverse lookup) separately by only setting one of the flags process_locality_files, process_street_files, process_address_files, create_reverse_lookup_shelve, and create_gnaf_address_csv_file to True and all others to False, and repeat this process until all pre-processing is done.

Note that these results are only particular to the above given computing platform, actual running times may heavily depend upon processor speed, memory access times as well as disk input and output times. Note also that currently the process-gnaf.py program only runs sequentially, we are planning to develop a parallel version in the future.