|
|
|
Febrl - Freely extensible biomedical record linkage |
|
|
|
10.5.3 Geocoding To-Do
The following incomplete list contains outstanding development tasks
and ideas for improvements of the Febrl geocode matching
system.
- Preprocessing G-NAF from SQL databases
The preprocessing as described in
Section 10.3 is currently based on CSV
(comma separated values) text files, as this was the format
G-NAF was supplied in it's first version. Recent and future
G-NAF releases are based on binary files and include a program
for unpacking and loading the G-NAF tables into SQL databases.
The program process-gnaf.py and module
gnaffunctions.py (which contains the main computational
routines used in process-gnaf.py) therefor have to be
rewritten to allow loading from SQL databases instead of CSV
text files.
- Web demo queue handling and e-mail confirmation
The current Web demo can only handle one request at a time, with
other simultaneously incoming request being rejected (i.e. the
user will get a 'Can not connect to Febrl geocoding
server. Please try again later...'
error message). It is
desirable to add a queue handling system which receives and
handles incoming geocoding request, stores them in a
first-in-first-out (FIFO) queue, and ideally informs the
user with an estimation of the time his request will be
processed. This becomes important once users start to upload
larger files for geocoding. And e-mail confirmation with an
estimated time of completion of a geocoding job would be a good
solution.
- Street number interpolation
If an exact address can't be found, but the street can be found,
then it should be possible to interpolate between the nearest
street numbers to the one required by fitting a cubic spline to
the available numbers and then taking the point midway along the
spline between the two nearest numbers (a similar technique is
described in the G-NAF Licensee's Guide, see Appendix A,
section A.12.2, Gap Geocoder). Finding the closest street number
neighbours for a given (not available) street number
might be a performance issue. Also, what about if there are not
two numbers (e.g. the missing number is at the end of a street,
how to extrapolate), and what about if a whole range of numbers
is missing (e.g. we want 42 Miller Street and the
closest numbers are 5 - on the other side of the street
- and 98).
- Clustering and outlier detection with many address matches
When more than one address level match is found, the geocoding
engine currently distinguishes between an average match
(if the localities all located within a user definable small
area), or an many match (if the distances among them are
are larger than this small area). Experience has shown that with
several address level matches there is a cluster of them in one
location and then one outlier far away (which prevents the
calculation of an average match). A clustering and outlier
detection could be applied when several address level matches
are found with large distances between them. If outliers are
detected and removed it might be possible to increase the number
of average address matches.
Release 0.3.1, documentation updated on July 1, 2005.