About me
Contact me
My research
Courses
Humour
Fencing
Quote for the Day
Home
|
You've come to Doug Aberdeen's old pages. In 5 seconds you will taken to my new pages http://sml.nicta.com.au/~daa/
(This assignment suggestion comes from Vishy)
Carefully read the paper
which describes a method to speed up the k-means clustering algorithm by using
properties of a metric. To complete this assignment you are expected to:
- Implement the algorithm in some not too obscure language
- Provide a simple makefile if it needs to be compiled
- Provide enough basic comments that I should be able to quickly figure out which part
of the code does each part of the algorithm
- It should implement the original k-means algorithm as well as
the improved k-means algorithm
- The program should take the following switches (or matlab variables)
- -i [string] input file/stdin by default
- -o [string] output file/stdout by default
- -k [int] number of clusters
- -a run the original k-means algorithm
- -b run the improved k-means algorithm
- -c run both the algorithms
- The input data is provided in the form of a simple text file with the
first line being the number of data points (int) and the second line being
the number of dimensions per data point (int) and the third line onwards
contain one data point per line. See an example here
- You can safely assume that the input data is sane (don't waste your time
building in complicated error checks)
- The expected output contains the following
- Input File: [string]
- Number of Clusters: [int]
- Algorithm: [original/improved]
- Time:
- Number of distance computations
- Cluster Centers:
- If you are running both the algorithms then the summary must be provided
first for the original algorithm and then for the improved algorithm.
- Two data files are provided: test.txt.bz2 and train.txt.bz2 . Because we are
clustering there is no real difference between the training and testing sets, just think
of them as two different size data sets to cluster.
|