The following files are provided with the current distribution of Febrl (Version 0.2.2).
ANUOS_v1.1.txt LICENSE.txt README.txt address.py classification.py classificationTest.py comparison.py comparisonTest.py dataset.py datasetTest.py date.py dateTest.py encode.py encodeTest.py febrl.py indexing.py lap.py lapTest.py lookup.py lookupTest.py mymath.py mymathTest.py name.py output.py parallel.py project-deduplicate.py project-linkage.py project-standardise.py randomselect.py simplehmm.py simplehmmTest.py standardisation.py stringcmp.py stringcmpTest.py
tagdata.py tcsv.py trainhmm.py
hmm/
directory contains some example hidden Markov
model training data sets ('.csv'
files) and some example
HMMs derived from them. The training data has been derived from
files of NSW death certificates and MDC (Midwives Data
Collection) data. It should work adequately with most Australian
name data and NSW address data. The tagging look-up tables in
the data/
directory will need to be modified to suit
other states of Australia or other countries. In future versions
we plan to include look-up tables and example training sets
which are suitable for initial use anywhere in Australia. We are
also happy to include example files for other countries if these
are contributed.
address-absdiscount.hmm address-laplace.hmm address-sample-training-data.csv address.hmm hmm-states.txt name-absdiscount.hmm name-laplace.hmm name-sample-training-data.csv name.hmm
data/
directory contains look-up tables,
correction-lists and frequency-tables.
address_corr.lst address_misc.tbl address_qual.tbl country.tbl givenname_f.tbl givenname_f_freq.csv givenname_m.tbl givenname_m_freq.csv institution_type.tbl name_corr.lst name_misc.tbl name_prefix.tbl post_address.tbl postcode_centroids.csv saints.tbl surname.tbl territory.tbl title.tbl unit_type.tbl wayfare_type.tbl
dbgen/
directory contains the database generator
generate.py and all its associated files. Frequency
files are stored in a sub-directory dbgen/data/
.
README.txt dataset1.csv dataset2.csv dataset3.csv dataset4a.csv dataset4b.csv generate.py data/address1.csv data/address2.csv data/givenname.csv data/postcode.csv data/state.csv data/streetnumber.csv data/suburb.csv data/surname.csv