About me
Contact me
My research
Courses
Humour
Fencing
Quote for the Day
Home

You've come to Doug Aberdeen's old pages. In 5 seconds you will taken to my new pages http://sml.nicta.com.au/~daa/
GSAC6017 : Overview of Statistical Machine Learning
The 2004 course begins October 11 and will run for 4 weeks.
Here is the syllabus .
Assignments
 Stochastic gradient descent (Theory) ,

Fast KMeans (Programming)

Graphical Models (Theory)

Reinforcement Learning(Programming)
Slides
 First half, Second half
 Full lecture
 First half, Second half
 Full lecture
 First half, Second half
 First half (slides not covered in lecture 5), Second half
 Part 1, Part 2, Part 3
(Presented by Alex Smola; RSISE, Room 207)
 Same slides as lecture 7. Lecture at Northbourne Ave.
 Full lecture (up to slide 71) Slides from Chris Bishop
 First half, finish slides from Chris Bishop, second half
 Full lecture (same as lecture 10, part 2, but corrected and expanded)
 Full lecture
Required Reading
Recommended Reading
Exercises

Genetic Programming: Efficient dense matrix multiplication is important for
many machine learning algorithms. However, the best algorithm for multiplying matrices
is very specific to each memory architecture. Typically, matricies A (M*K) and B (K*N)
are split into subblocks of size (m*k) and (k*n), and a block of each matrix is
kept in L1 cache.
It is important to order the
blocks to minimise external memory accesses. This is nontrivial because good ordering
and cache use of A and B subblocks may not result in good cache use for results in C (M*N).
The best autooptimising codes do a grid search of the parameter space, which is very slow.
Sketch a representation of a matrixmatrix multiplication (AB=C) blocking algorithm
with parameters controlling the size and traversal order of subblocks in matrices A, B and C.
Assume an inbuilt dense MM algorithm which will correctly compute part of the result given A
subblocks for A, B and C. Consider a fitness function to test the algorithm. Sketch out crossover and mutation operators.
As far as I know, this approach has never been tried before. Given the success of EA's for
FFT codes, which share some properties of matrixmultiplication, it could be worth pursuing
if you think you have a good design.

Gradient ascent: Recall the Pr[rainw1,w2] = w1 f1 + w2 f2 example, where w1 and w2 are
scalar parameters and f1 and f2 are input features. f1 = 1 if it's coudy and 0 otherwise.
f2 = 1 if it's spring, and 0 otherhwise. Four example instances are:
Cloudy, spring, Pr[rain] = 0.7
Cloudy, other, Pr[rain] = 0.5
Fine, spring, Pr[rain] = 0.2
Fine, other, Pr[rain] = 0.1
Use gradient descent to find the w1 and w2 parameters that minimise the mean square error:
E(w1, w2) = 0.5 \sum_examples (Pr[rain]  Pr[rainw1,w2])^2. Use one gradient calculation only,
performing a line search to find roughly the best step size. If the last example were
Pr[rain] = 0.0, the weights would clearly be w1=0.5 and w2=0.2, for 0 error. The last sample
is an example of noise in the inputs.

Several useful exercises along with worked solutions
covering Naive Bayes, Bayesian Networks, and Reinforcement Learning.

Use Fisher's Discriminant Function to find a function h(x) = sgn(w.x + b) that separates the and function, with training points: (x_1 = (0,0), c_1 = 0), (x_2 = (1,0), c_2 = 0), (x_3 = (1,0), c_3 = 0), (x_4 = (1,1), c_4 = 1). You will have to choose an appropriate threshold b.

Rederive the error backpropagation algorithm using the tanh function as the squashing
function and using the log probability loss function: E =  log Pr(c_k = t  x).
That is, maximise the probability that the true output class is t given the input x.
The output Pr(c_k  x) is given by the "softmax" function which converts the real valued
NN output into a well behaved probability distribution:
Pr(c_k  x) = exp(z_k) / ( \sum_k' exp(z_k') )
Where the z_k's are the multilayer perceptron outputs.
Useful identities:
d Pr(c_k'  x) / d z_k = Pr(c_k'  x) ( \delta(k,k')  Pr(c_k  x) )
where \delta(i,j) is 1 if i=j and 0 otherwise;
y_j = tanh(a_j)
d tanh(a_j) / d a_j = 1  y_j^2
J =  log Pr(c_k  x)
d J / d Pr(c_k  x) =  1 / Pr(c_k  x)
