Bob's Favourite Papers

On this page I have collected my favourite papers and given an explanation of their key ideas and significance. I have arranged the papers under the headings of Machine Learning, Signal Processing and Miscellaneous.

Machine Learning

My work in machine learning has covered learning theory, online learning algorithms and kernel machines.

Learning Theory

Peter L. Bartlett, Philip M. Long and Robert C. Williamson, ``Fat-Shattering and the Learnability of Real-Valued Functions'' Journal of Computer and System Sciences, 52(3), 434-452, (1996).

This paper provided the first necessary and sufficient conditions for learnability of real valued functions. Previously it was known that finiteness of the Pollard Pseudo-Dimension was sufficient, but it was also known not to be necessary. They key idea of the proof of the main result was inspired by ideas of dithering in analog-to-digital converters used in audio. Fat-shattering was introduced in the machine learning community by Alon et al, but actually show up in work by Tihomirov in the early sixties. A distribution dependent version was also used by Talagrand around the time the present paper was written.

Wee Sun Lee, Peter L. Bartlett and Robert C. Williamson, ``Efficient Agnostic Learning of Neural Networks with Bounded Fan-in'' IEEE Transactions on Information Theory, 42(6), 2118-2132, (1996).

This was the first (I believe) paper to give a COLT-like proof of the efficient learnability of a certain class of neural networks. As with most COLT-style proofs, the constants are ridiculous and useless...

Wee Sun Lee, Peter L. Bartlett and Robert C. Williamson, ``The Importance of Convexity in Learning with Squared Loss'' IEEE Transactions on Information Theory 44(5), 1974-1980, 1998.

The normal bounds of learning theory provide for a convergence rate dominated by 1/sqrt{n} in the agnostic setting (wheras one can obtain 1/n in the realisable case). Somewhat amazingly, when using squared loss for regression, if the class of functions is convex, then one can obtain a faster rate. Yet another example of the general rule: geometry matters! The following paper extends the idea in a more esoteric manner and more importantly, points out that the lower bound claimed in the present paper is false.

Shahar Mendelson and Robert C. Williamson, Agnostic Learning Nonconvex Function Classes, pages 1-13, in J. Kivinen and R.H. Sloan (Eds), Computational Learning Theory 15th Annual Conference on Computational Learning Theory, COLT 2002, Sydney, Australia, July 8-10, 2002. Proceedings Lecture Notes in Artificial Intelligence 2375.

This builds on the convexity result of the previous paper and shows that what really matters is how far (in a certain special sense) ones target function is to the set of nonuniqueness points of the function class. Convex function classes have no non-uniqueness points (every function has a unique best approximation in the class). It also shows that the sort of lower bound claimed in the previous paper can not be true.
Robert C. Williamson, John Shawe-Taylor, Bernhard Schölkopf and Alex J. Smola, Sample Based Generalization Bounds, accepted subject to revision to IEEE Transactions on Information Theory, November 1999.
Shows how empirical covering numbers can be used to bound the performance of learning machines. The significan is twofold: the empirical covering numbers are easier to compute, and they can reflect the complexity of the task actually being considered (rather than an a priori bound which is inevitably worst case).

This paper (which was accepted subject to changes) will probably never be formally published. The changes requested would have been a lot of work, but none of the authors felt they would actually add much value. A shorter conference version was published (see P112 in publications list).

Erik Weyer, Iven M.Y. Mareels, and Robert C. Williamson, ``Sample Complexity of Stochastic Least Squares System Identification'', IEEE Transactions on Automatic Control, 44(7), 1370-1383 (1999)

This was the first work to study the finite sample complexity of linear system identification. Historically only asymptotic results were obtained. (The paper was submitted in 1995 and actually based in part on results in a CDC paper from 1992. Thus from time of doing the work (1992) to publication was 7 years!)

John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and Martin Anthony, ``Structural Risk Minimization over Data-Dependent Hierarchies'', IEEE Transactions on Information Theory, 44(5), 1926-1940 (1998).

This paper (which to my surprise is one of my most highly cited) was the first COLT like analysis of the performance of the maximum margin algorithm. The reason for the difficulty in analysing such algorithms is that the usual machinery presumes there is a fixed class of functions from which the algorithm draws an hypothesis. In the MM case (and others), the function class actually depends on the data. Now you simply can not sweep this under the carpet: you have to use the information in the data only once. The paper introduced the notion of luckiness which is a way of thinking about how good your sample is. The following paper provides a more powerful and general framework.

Ralf Herbrich and Robert C. Williamson, Algorithmic Luckiness, Journal of Machine Learning Research 3, 175-212 (2002).

The traditional methods of statistical learning theory analyse learning algorithms only in terms of the class of functions from which the algorithm draws its hypotheses. The luckiness framework (previous paper) introduced a method by which performance on the sample received could be captured, but it still essentially used the device of considering the class of functions from which the (data-dependent) hypotheses were drawn. In contrast, the present paper, for the first time, allowed the use of the traditional machinery of statistical learning theory in a manner that allowed the use of properties of the algorithm beyond merely the class of functions from which the hypothesis is drawn. The paper formally subsumes the previous luckiness framework, and in addition allows the analysis of sparsity and algorithmic stability. Finally, the new machinery is applied to the analysis of the maximum margin algorithm to obtaion a new performance bound which interestingly combines three key factors: margin, sparsity, and degree of clustering of the data.

Kim L. Blackmore, Robert C. Williamson and Iven M.Y. M areels, ``Decision Region Approximation'', IEEE Transactions on Information Theory, 43(3), 903-907, 1997.

Most work in Learning Theory focusses on the sample error in the learning process. One also has to deal with the approximation error. The present paper is one of the very few that deals with the approximation by the levels sets of a function, which is what is usually done for classification learning.

Online Learning Algortithms

Online learning algorithms are perhaps the most widely used since they can deal with arbitrarily large amounts of data and can be used in real time applications. The first three papers below focus on the use for nonlinearly paramterized classes of functions. The last two papers focus on the geometry of online learning.

Kim L. Blackmore, Iven M.Y. Mareels, and Robert C. Williamson, ``Learning Nonlinearly Parametrized Decision Regions'' Summary in Journal of Mathematical Systems, Estimation and Control , 6(1), 129-132 (1996). Full version published electronically.

This paper presents a careful mathematical analysis of the behaviour of stochastic gradient descent for classification learning when the function class is nonlinearly parametrised.

This is a good example of the problems of picking the right journal! The paper was submitted in 1993 but was not published until 1996. And the journal became defunct in 1998...

Kim L. Blackmore, Robert C. Williamson and Iven M.Y. Mareels, ``Local Minima and Attractors at Infinity of Gradient Descent Learning Algorithms,'' Summary in Journal of Mathematical Systems, Estimation and Control, 6(2), pages 231-234, (1996). Full version was published electronically.

Many things can go wrong with online learning when the functions are nonlinearly parametrised. The algorithms can get "stuck" in local minima. But they can also fail even when they are not stuck by heading towards an attractor at infinity.
Kim L. Blackmore, Robert C. Williamson, Iven M.Y. Mareels, William A. Sethares, ``On-line Learning via Congregational Gradient Descent'', Mathematics of Control, Signals and Systems, 10(4), 331-363, 1997.
Genetic Algorithms were often held up to be superior to gradient descent because they did not get stuck in a local minima. This comparison always struck me as unfair as the gradient descent algorithms only maintain a single hypothesis. A fairer comparison is with a variant of gradient descent which maintains several candidate hypotheses and regularly culls the weak ones. The intricacy of the analysis of the present paper arises from the use of stochastic gradient descent algorithms and the subtelty of figuring out when to cull.

A followup paper was prepared but never published. One result in it that might be of broader interest is in section 7 and concerns the density of learning problems.

Robert E. Mahony and Robert C. Williamson, "Prior Knowledge and Preferential Structures in Learning Algorithms," Journal of Machine Learning Research, 1, 311-355, 2001. (see the final version on the JMLR web page

Studies the geometry of prior knowledge for online learning algorithms. Shows how one can define a prior for stochastic gradient descent. One can derive the Exponentiated Gradient descent algorithm from this framework. The framework is being extended; see P174 in publication list.

Kernel Machines

Bernhard Schölkopf, Alex Smola, Robert Williamson and Peter Bartlett, ``New Support Vector Algorithms,'' Neural Computation 12(5), 1207-1245, May 2000.

This paper introduced the nu-trick (hence "new"; geddit?) which is a nicer parameterization to use for SVMs. The nu parametrisation is now widely used, and this paper highly cited (117 according to Google scholar in March 2005)

Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Robert C. Williamson and Alex J.Smola, Estimating the Support of a High-Dimensional Distribution,, Microsoft technical report MSR-TR-99-87. Slightly abridged version in Neural Computation 13(7), 1443-1471, 2001.

Showed how to use kernel machines for novelty detection or one-class classification. Previous algorithms in the statistical literature are all limited to low dimensional input spaces. By utilising the nu-trick (see previous paper) a very nice knob to control the percentage of outliers is provided.

Robert C. Williamson, Alex Smola and Bernhard Schölkpof, ``Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators,'' IEEE Transactions on Information Theory, 47(6), 2516-2532, 2001.

This paper showed (in a COLT setting) how the kernel of a SVM affects its generalisation performance. The paper used the machinery of entropy numbers of linear operators. A follow-up paper provides a more explicit calculation of the bounds which is easier to interpret.

Signal Processing

Brian C. Lovell and Robert C. Williamson, The Statistical Performance of Some Instantaneous Frequency Estimators IEEE Transactions on Signal Processing 40, pages 1708-1723, (July 1992).

Brian C. Lovell, Robert C. Williamson and Boaulem Boashash, The Relationship Between Instantaneous Frequency and Time Frequency Representations IEEE Transactions on Signal Processing , 41, pages 1458-1461 (1993).

These two papers were written whilst I was a student along with Brian Lovell who was also a student at the time. They show that the complex and intricate schemes for estimating instantaneous frequency using the Wigner-Ville distribution are, when done properly, nothing more than filtered finite differences of the phase of the analytic signal. "Doing it properly" simply means using the right formula for the moments of a random variable defined on a circle.

Darren Ward, Rodney A. Kennedy and Robert C. Williamson, "The Theory of Broadband Sensor Arrays with Frequency Invariant Far-Field Beam Patterns" Journal of the Acoustical Society of America 97(2), 1023-1034, (February 1995)

Shows how to design a beamformer (e.g. a microphone array) with a beampattern that is invariant with frequency (over wide frequency ranges).

Simon I. Hill and Robert C. Williamson, Convergence of Exponentiated Gradient Algorithms,IEEE Transactions on Signal Procesing, 49(6), 1208-1215, June 2001.

This paper analysed the Exponentiated Gradient Descent Algorithm, familiar in the Machine Learning community, using the standard machinery of signal processing.

Paul D. Teal, Robert C. Williamson and Rodney A. Kennedy, "Error Performance of a Channel of Known Impulse Response", Acoustics, Speech, and Signal Processing, 2000. ICASSP '00. Proceedings. 2000 IEEE International Conference on , Volume: 5 , 2000 Page(s): 2733 -2736.

Classical text books on communication theory seperately consider the error performance of a channel and its equalisation. It turns out for a fixed Signal to Noise Ratio, some channels are much harder than others. This has significant implications for mobile communications when the channel is rapidly varying.

Biljana Radlovic, Robert C. Williamson and Rodney A. Kennedy, "Equalization in an Acoustic Reverberant Environment: Robustness Results," IEEE Transactions on Speech and Audio Processing, 8(3), 311-319, May 2000.

Reverberation is a standard problem in audio transduction. (Humans remove reverberation remarkably well; See why. The present paper shows that linear equalisation is intrinsically non-robust. Even supposing you perfectly removed the reverberation by equalisation, if the source just moves a few centimetres, you will be worse off than if you had not equalised at all. Interestingly, the cocktail party effect (more specifically demixing convolutive mixtures) is a harder problem than equalisation, so it too will be very non-robust, but that does not stop lots of people writing papers anbd doing experiments on simulated data!
Richard K. Martin, William A. Sethares, and Robert C. Williamson, "Exploiting Sparsity in Adaptive Filters," IEEE transactions on Signal Processing, 50(8), 1883-1893, August 2002.
By exploting and simplifying work with Mahony on prior knowledge in online learning we showed how to tune adaptive filtering algorithms to exploit sparsity of the target in a very simple but effective way.

Darren B. Ward, Eric A. Lehmann and Robert C. Williamson, Particle Filtering Algorithms for Tracking an Acoustic Source in a Reverberant Environment, IEEE Transactions on Speech and Audio Processing, 11(6), 826-836, November 2003.

Shows how to use particle filtering combined with beamforming to track an acoustic source in a real reverberant room. As far as I am aware this was the first published work deminstrating reliable tracking in a typical reverberant office.


Robert C. Williamson and Tom Downs, ``Probabilistic Arithmetic: Numerical Methods for Calculating Convolutions and Dependency Bounds,'' International Journal of Approximate Reasoning, 4(2), pages 89-158 (1990).

The main contribution of my PhD thesis and my only paper that filled an entire issue of a journal! It presented numerical algorithms for computing bounds on the probability distribuition of elementary functions of random variables when only their marginal distributions are known.

Algorithms developed are now available in commercial software from RAMAS Software. (See Risk Calc, Probability Bounds Analysis) and seem widely cited.

I do not have an electronic version of the paper, but you can look at my thesis (without pictures). (Here is a scanned copy of my thesis with pictures (12M)) Note that I never published part 2 of the paper (bad idea to call a paper part 1 without a definite plan to make part 2!). Chapter 4 of my thesis (connection with other ideas) was what I had in mind for part 2...

Robert C. Williamson, ``The Law of Large Numbers for Fuzzy Variables under a General Triangular Norm Extension Principle,'' Fuzzy Sets and Systems, 41, 55-81, (1991).

A paper arising from my PhD thesis. Apart from the technical contribution (which incidentally contains an error), there is a remark, which I believe was novel at the time, explaining how the calculus of operations on fuzzy numbers can strictly be interpreted in the theory of probability. The key is that one needs to consider the marginals only, and compute a resultant distribution which is the worst case over all possible joint distributions. It is very often said that fuzzy numbers are different to random variables because the manner by which they are combined is so different. The truth of the matter is that one can fully recover fuzzy sets from probability (!). A (very long!) argument along these lines comprises section 4.5 of my PhD thesis.

I do not have an electronic version of the paper, but you can look at my thesis (without pictures) which contains everything in the paper. Here is a scanned copy of my thesis with pictures (12M)