D. Manifest

The following files are provided with the current distribution of Febrl (Version 0.3).

The main directory contains the Python programs, license files and documentation. Note that a PDF and a compressed (gzipped) PostScript version of the manual are available for download from the Febrl web site but they are not included in the standard distribution due to their large sizes.

      ANUOS-1.2.txt
      LICENSE.txt
      README.txt
      address.py
      classification.py
      classificationTest.py
      comparison.py
      comparisonTest.py
      dataset.py
      datasetTest.py
      date.py
      dateTest.py
      encode.py
      encodeTest.py
      febrl.py
      geocoding.py
      indexing.py
      lap.py
      lapTest.py
      lookup.py
      lookupTest.py
      mymath.py
      mymathTest.py
      name.py
      output.py
      parallel.py
      phonenum.py
      phonenumTest.py
      project-deduplicate.py
      project-geocode.py
      project-linkage.py
      project-standardise.py
      qgramindex.py
      qgramindexTest.py
      randomselect.py

      simplehmm.py
      simplehmmTest.py
      standardisation.py
      stringcmp.py
      stringcmpTest.py
      tagdata.py
      trainhmm.py

The data directory contains look-up tables, correction-lists and frequency-tables.

      address_corr.lst
      address_misc.tbl
      address_qual.tbl
      country.tbl
      givenname_f.tbl
      givenname_f_freq.csv
      givenname_m.tbl
      givenname_m_freq.csv
      institution_type.tbl
      locality_name_act.tbl
      locality_name_nsw.tbl
      name_corr.lst
      name_misc.tbl
      name_prefix.tbl
      post_address.tbl
      postcode_act.tbl
      postcode_act_freq.csv
      postcode_centroids.csv
      postcode_nsw.tbl
      postcode_nsw_freq.csv
      saints.tbl
      suburb_act_freq.csv
      suburb_nsw_freq.csv
      surname.tbl
      surname_act_freq.csv
      surname_nsw_freq.csv
      territory.tbl
      title.tbl
      unit_type.tbl
      wayfare_type.tbl

The dsgen directory contains the data set generator generate.py and all its associated files. Frequency files are stored in a sub-directory dsgen/data.

      README.txt
      dataset1.csv
      dataset2.csv
      dataset3.csv
      dataset4a.csv
      dataset4b.csv
      generate.py

      data/address1-freq.csv
      data/address2-freq.csv
      data/age-freq.csv
      data/givenname-freq.csv
      data/givenname-misspell.tbl
      data/postcode-freq.csv
      data/state-freq.csv
      data/streetnumber-freq.csv
      data/suburb-freq.csv
      data/suburb-misspell.tbl
      data/surname-freq.csv
      data/surname-misspell.tbl

The geocode directory contains programs and files needed for the Febrl geocoding system.

      get-neighbour-regions.py
      gnaffunctions.py
      pc-neighbours-1.txt
      pc-neighbours-2.txt
      process-gnaf.py
      reverse-gnaf.py
      suburb-neighbours-1.txt
      suburb-neighbours-2.txt
      testaddresses-small.txt

The hmm directory contains some example hidden Markov model training data sets ('.csv' files) and some example HMMs derived from them. The training data has been derived from files of NSW death certificates and MDC (Midwives Data Collection) data. It should work adequately with most Australian name data and NSW address data. The tagging look-up tables in the data directory will need to be modified to suit other states of Australia or other countries. In future versions we plan to include look-up tables and example training sets which are suitable for initial use anywhere in Australia. We are also happy to include example files for other countries if these are contributed.

      address-absdiscount.hmm
      address-laplace.hmm
      address-sample-training-data.csv
      address.hmm
      geocode-nsw-address.hmm
      hmm-states.txt
      name-absdiscount.hmm
      name-laplace.hmm
      name-sample-training-data.csv
      name.hmm