Front Matter
1. Acknowledgments
2. Introduction
- 2.1 Performance
3. Data Cleaning and Record Linkage
4. System Overview
5. Configuration and Running Febrl using a Module derived from 'project.py'
6. Data Cleaning and Standardisation
- 6.1 Name and Address Cleaning and Standardisation
- 6.2 Output Fields
- 6.3 Name Cleaning and Standardisation using a Rules Based Approach
- 6.4 Name Cleaning and Standardisation using a Hidden Markov Model Based Approach
- 6.5 Address Cleaning and Standardisation using a Hidden Markov Model Based Approach
- 6.6 Date Cleaning and Standardisation
- 6.7 Field Passing
- 6.8 Record Cleaning and Standardisation
- 6.9 Starting a Standardisation Process
7. Hidden Markov Models for Data Standardisation
- 7.1 Hidden Markov Model Implementation Module 'simplehmm.py'
8. Hidden Markov Model Training
- 8.1 Program 'tagdata.py'
- 8.2 Program 'trainhmm.py'
9. Record Linkage and Deduplication
- 9.1 Indexing
- 9.2 Field Comparison Functions
- 9.3 Record Comparator
- 9.4 Example Field and Record Comparator Initialisation
- 9.5 Classification
  - 9.5.1 Fellegi and Sunter Classifier
  - 9.5.2 Flexible Classifier
- 9.6 Starting a Linkage or Deduplication Process
10. Output
- 10.1 Record Pair One-To-One Assignment Restrictions
11. Auxiliary Programs
- 11.1 Program 'randomselect.py'
- 11.2 Database generator Program 'generate.py'
12. Data Set Access
- 12.1 COL Data Set Implementation
- 12.2 CSV Data Set Implementation
- 12.3 SQL Data Set Implementation
- 12.4 Shelve Data Set Implementation
- 12.5 Memory Data Set Implementation
13. Look-up and Frequency Tables
- 13.1 Correction List
- 13.2 Tagging Look-up Table
- 13.3 Frequency Look-up Table
- 13.4 Geographic Location Look-up Table
14. Logging and Verbose Output
15. Installation
16. Parallelism
A. Hidden Markov Model States
B. List of Tags
C. Rule-based Name Segmentation
- C.1 Input
- C.2 Process Overview
  - C.2.1 Step 1: Allocating the elements into one of the five sub-lists
  - C.2.2 Step 2: Parse each sub-list and assign into appropriate output name component
- C.3 Output
D. Manifest
E. To-Do: Outstanding Development Tasks, Possible Additions and Enhancements
- E.1 Data Cleaning and Standardisation
- E.2 Record Linkage and Deduplication
- E.3 Febrl System
F. Version History
G. Support Arrangements
H. ANU - Open Source License
Bibliography
Index
About this document ...

Contents