About me
Contact me
My research
Courses
Humour
Fencing
Quote for the Day
Home
|
You've come to Doug Aberdeen's old pages. In 5 seconds you will taken to my new pages http://sml.nicta.com.au/~daa/
GSAC6017 : Overview of Statistical Machine Learning
The 2004 course begins October 11 and will run for 4 weeks.
Here is the syllabus .
Assignments
- Stochastic gradient descent (Theory) ,
-
Fast K-Means (Programming)
-
Graphical Models (Theory)
-
Reinforcement Learning(Programming)
Slides
- First half, Second half
- Full lecture
- First half, Second half
- Full lecture
- First half, Second half
- First half (slides not covered in lecture 5), Second half
- Part 1, Part 2, Part 3
(Presented by Alex Smola; RSISE, Room 207)
- Same slides as lecture 7. Lecture at Northbourne Ave.
- Full lecture (up to slide 71) Slides from Chris Bishop
- First half, finish slides from Chris Bishop, second half
- Full lecture (same as lecture 10, part 2, but corrected and expanded)
- Full lecture
Required Reading
Recommended Reading
Exercises
-
Genetic Programming: Efficient dense matrix multiplication is important for
many machine learning algorithms. However, the best algorithm for multiplying matrices
is very specific to each memory architecture. Typically, matricies A (M*K) and B (K*N)
are split into sub-blocks of size (m*k) and (k*n), and a block of each matrix is
kept in L1 cache.
It is important to order the
blocks to minimise external memory accesses. This is non-trivial because good ordering
and cache use of A and B sub-blocks may not result in good cache use for results in C (M*N).
The best auto-optimising codes do a grid search of the parameter space, which is very slow.
Sketch a representation of a matrix-matrix multiplication (AB=C) blocking algorithm
with parameters controlling the size and traversal order of sub-blocks in matrices A, B and C.
Assume an inbuilt dense MM algorithm which will correctly compute part of the result given A
sub-blocks for A, B and C. Consider a fitness function to test the algorithm. Sketch out cross-over and mutation operators.
As far as I know, this approach has never been tried before. Given the success of EA's for
FFT codes, which share some properties of matrix-multiplication, it could be worth pursuing
if you think you have a good design.
-
Gradient ascent: Recall the Pr[rain|w1,w2] = w1 f1 + w2 f2 example, where w1 and w2 are
scalar parameters and f1 and f2 are input features. f1 = 1 if it's coudy and 0 otherwise.
f2 = 1 if it's spring, and 0 otherhwise. Four example instances are:
Cloudy, spring, Pr[rain] = 0.7
Cloudy, other, Pr[rain] = 0.5
Fine, spring, Pr[rain] = 0.2
Fine, other, Pr[rain] = 0.1
Use gradient descent to find the w1 and w2 parameters that minimise the mean square error:
E(w1, w2) = 0.5 \sum_examples (Pr[rain] - Pr[rain|w1,w2])^2. Use one gradient calculation only,
performing a line search to find roughly the best step size. If the last example were
Pr[rain] = 0.0, the weights would clearly be w1=0.5 and w2=0.2, for 0 error. The last sample
is an example of noise in the inputs.
-
Several useful exercises along with worked solutions
covering Naive Bayes, Bayesian Networks, and Reinforcement Learning.
-
Use Fisher's Discriminant Function to find a function h(x) = sgn(w.x + b) that separates the and function, with training points: (x_1 = (0,0), c_1 = 0), (x_2 = (1,0), c_2 = 0), (x_3 = (1,0), c_3 = 0), (x_4 = (1,1), c_4 = 1). You will have to choose an appropriate threshold b.
-
Re-derive the error back-propagation algorithm using the tanh function as the squashing
function and using the log probability loss function: E = - log Pr(c_k = t | x).
That is, maximise the probability that the true output class is t given the input x.
The output Pr(c_k | x) is given by the "soft-max" function which converts the real valued
NN output into a well behaved probability distribution:
Pr(c_k | x) = exp(z_k) / ( \sum_k' exp(z_k') )
Where the z_k's are the multi-layer perceptron outputs.
Useful identities:
d Pr(c_k' | x) / d z_k = Pr(c_k' | x) ( \delta(k,k') - Pr(c_k | x) )
where \delta(i,j) is 1 if i=j and 0 otherwise;
y_j = tanh(a_j)
d tanh(a_j) / d a_j = 1 - y_j^2
J = - log Pr(c_k | x)
d J / d Pr(c_k | x) = - 1 / Pr(c_k | x)
|