About me

Contact me

My research

Courses

Humour

Fencing

Quote for the Day

Home

You've come to Doug Aberdeen's old pages. In 5 seconds you will taken to my new pages http://sml.nicta.com.au/~daa/

(This assignment suggestion comes from Vishy)

Carefully read the paper which describes a method to speed up the k-means clustering algorithm by using properties of a metric. To complete this assignment you are expected to:

  • Implement the algorithm in some not too obscure language
  • Provide a simple makefile if it needs to be compiled
  • Provide enough basic comments that I should be able to quickly figure out which part of the code does each part of the algorithm
  • It should implement the original k-means algorithm as well as the improved k-means algorithm
  • The program should take the following switches (or matlab variables)
    • -i [string] input file/stdin by default
    • -o [string] output file/stdout by default
    • -k [int] number of clusters
    • -a run the original k-means algorithm
    • -b run the improved k-means algorithm
    • -c run both the algorithms
  • The input data is provided in the form of a simple text file with the first line being the number of data points (int) and the second line being the number of dimensions per data point (int) and the third line onwards contain one data point per line. See an example here
  • You can safely assume that the input data is sane (don't waste your time building in complicated error checks)
  • The expected output contains the following
    • Input File: [string]
    • Number of Clusters: [int]
    • Algorithm: [original/improved]
    • Time:
    • Number of distance computations
    • Cluster Centers:
  • If you are running both the algorithms then the summary must be provided first for the original algorithm and then for the improved algorithm.
  • Two data files are provided: test.txt.bz2 and train.txt.bz2 . Because we are clustering there is no real difference between the training and testing sets, just think of them as two different size data sets to cluster.

The views and opinions expressed on this web page are not necessarily those of NICTA or the Australian National University. Any HTML or image from this page may be copied and re-used freely but must not be sold.
Feedback:Doug.Aberdeen AT anu.edu.au