About me

Contact me

My research




Quote for the Day


You've come to Doug Aberdeen's old pages. In 5 seconds you will taken to my new pages http://sml.nicta.com.au/~daa/

GSAC6017 : Overview of Statistical Machine Learning

The 2004 course begins October 11 and will run for 4 weeks.

Here is the syllabus .

Schedule of remaining lectures


  1. Stochastic gradient descent (Theory) ,
  2. Fast K-Means (Programming)
  3. Graphical Models (Theory)
  4. Reinforcement Learning(Programming)


  1. First half, Second half
  2. Full lecture
  3. First half, Second half
  4. Full lecture
  5. First half, Second half
  6. First half (slides not covered in lecture 5), Second half
  7. Part 1, Part 2, Part 3 (Presented by Alex Smola; RSISE, Room 207)
  8. Same slides as lecture 7. Lecture at Northbourne Ave.
  9. Full lecture (up to slide 71) Slides from Chris Bishop
  10. First half, finish slides from Chris Bishop, second half
  11. Full lecture (same as lecture 10, part 2, but corrected and expanded)
  12. Full lecture

Required Reading

Recommended Reading


  1. Genetic Programming: Efficient dense matrix multiplication is important for many machine learning algorithms. However, the best algorithm for multiplying matrices is very specific to each memory architecture. Typically, matricies A (M*K) and B (K*N) are split into sub-blocks of size (m*k) and (k*n), and a block of each matrix is kept in L1 cache. It is important to order the blocks to minimise external memory accesses. This is non-trivial because good ordering and cache use of A and B sub-blocks may not result in good cache use for results in C (M*N).

    The best auto-optimising codes do a grid search of the parameter space, which is very slow. Sketch a representation of a matrix-matrix multiplication (AB=C) blocking algorithm with parameters controlling the size and traversal order of sub-blocks in matrices A, B and C. Assume an inbuilt dense MM algorithm which will correctly compute part of the result given A sub-blocks for A, B and C. Consider a fitness function to test the algorithm. Sketch out cross-over and mutation operators. As far as I know, this approach has never been tried before. Given the success of EA's for FFT codes, which share some properties of matrix-multiplication, it could be worth pursuing if you think you have a good design.

  2. Gradient ascent: Recall the Pr[rain|w1,w2] = w1 f1 + w2 f2 example, where w1 and w2 are scalar parameters and f1 and f2 are input features. f1 = 1 if it's coudy and 0 otherwise. f2 = 1 if it's spring, and 0 otherhwise. Four example instances are:

    Cloudy, spring, Pr[rain] = 0.7
    Cloudy, other, Pr[rain] = 0.5
    Fine, spring, Pr[rain] = 0.2
    Fine, other, Pr[rain] = 0.1

    Use gradient descent to find the w1 and w2 parameters that minimise the mean square error: E(w1, w2) = 0.5 \sum_examples (Pr[rain] - Pr[rain|w1,w2])^2. Use one gradient calculation only, performing a line search to find roughly the best step size. If the last example were Pr[rain] = 0.0, the weights would clearly be w1=0.5 and w2=0.2, for 0 error. The last sample is an example of noise in the inputs.

  3. Several useful exercises along with worked solutions covering Naive Bayes, Bayesian Networks, and Reinforcement Learning.

  4. Use Fisher's Discriminant Function to find a function h(x) = sgn(w.x + b) that separates the and function, with training points: (x_1 = (0,0), c_1 = 0), (x_2 = (1,0), c_2 = 0), (x_3 = (1,0), c_3 = 0), (x_4 = (1,1), c_4 = 1). You will have to choose an appropriate threshold b.

  5. Re-derive the error back-propagation algorithm using the tanh function as the squashing function and using the log probability loss function: E = - log Pr(c_k = t | x). That is, maximise the probability that the true output class is t given the input x. The output Pr(c_k | x) is given by the "soft-max" function which converts the real valued NN output into a well behaved probability distribution:
    Pr(c_k | x) = exp(z_k) / ( \sum_k' exp(z_k') )
    Where the z_k's are the multi-layer perceptron outputs. Useful identities:
    d Pr(c_k' | x) / d z_k = Pr(c_k' | x) ( \delta(k,k') - Pr(c_k | x) )
    where \delta(i,j) is 1 if i=j and 0 otherwise;
    y_j = tanh(a_j)
    d tanh(a_j) / d a_j = 1 - y_j^2
    J = - log Pr(c_k | x)
    d J / d Pr(c_k | x) = - 1 / Pr(c_k | x)

The views and opinions expressed on this web page are not necessarily those of NICTA or the Australian National University. Any HTML or image from this page may be copied and re-used freely but must not be sold.
Feedback:Doug.Aberdeen AT anu.edu.au