Contact Information
|
Research
My principal research interests lie in machine learning, with a particular emphasis on Bayesian and kernel learning methods. I am most interested in theoretical issues and algorthms with a direct impact in the practical application of machine learning techniques, including topics such as feature selection, model selection, performance estimation, model comparison, covariate shift, dealing with imbalanced or "non-standard" data and semi-supervised learning. Most of my applied work centres on problems arising in computational biology, in collaboration with the School of Chemistry and Pharmacy (CAP) and with the nearby John Iness Centre (JIC) and Institute for Food Reasearch (IFR). However I also have long-standing research links with the School of Environmental Sciences (ENV) and the Climatic Research Unit (CRU), working on applications of machine learning in the environmental sciences, particularly on modelling and exploiting predictive uncertainty.
Projects
- Bayesian Gene Selection for Cancer
Classification
- Drawing an Elephant with Four Parameters
- Identification of Disordered Proteins
- Bayesian Regulatory Element Detection (BRED)
- Modelling the Growth Domain of Microbial Pathogens
- Predictive Uncertainty in Environmental Modelling
- Interpretation of Archaeological Survey Data
Collaborators
- Nicola Talbot (i.e. Mrs Cawley)
- Mark Girolami
- Guido Sanguinetti
- Gareth Janacek
- Steven Hayward
- Geoff Moore
- Barry-John Theobald
- Mike Bevan
- Gary Barker
- Mike Peck
- Steve Dorling
- David Bescoby
- Neil Chroston
Grants
- MRC discipline-hopping award, grant 67192 - "Predicting protein
disorder with advanced machine learning tools", September 2004 -
August 2005.
- BBSRC grant 83/EGM16128 - "Computational approaches to identifying
gene regulatory systems in Arabidopsis", January 2002 - January
2006.
- BBSRC grant D17534 - "Kernel methods for growth domain modelling and
experimental design", April 2002 - December 2003.
- Royal Society grant 22270 - "Large-scale kernel methods", April 2001 - December 2001.
Prizes
- Joint winner of the ICANN-2009 environmental toxicity prediction
challenge.
(homepage, results)
- WCCI-2008 causation and prediction challenge, winner on SIDO and MARTI
problems. (www)
- WCCI-2008 Ford classification challenge, winner on dataset B. (www)
- IJCNN-2007 agnostic learning vs. prior knowledge challenge, best paper.
I also provided some of the baseline submissions for this challenge,
and for the related NIPS-2006 model selection game, that went unbeaten
("interim all prior" and "the bad" respectively).
(www)
- Joint winner of the WCCI-2006 performance prediction challenge, outright winner on tasks HIVA and NOVA and best paper. (www)
Selected Publications
- G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and
subsequent selection bias in performance evaluation, Journal of
Machine Learning Research, 2010.
Research, vol. 11, pp. 2079-2107, July 2010.
(www,
pdf)
- G. C. Cawley and G. J. Janacek, On allometric equations for predicting
body mass of dinosaurs, Journal of Zoology, vol. 280, no. 4,
pages 355-361, April 2010.
(doi,
pdf,supplementary material)
- G. C. Cawley and N. L. C. Talbot, Efficient approximate leave-one-out
cross-validation for kernel logistic regression, Machine
Learning, vol, 71, no. 2-3, pp. 243--264, June 2008.
(doi,pdf)
- G. C. Cawley and N. L. C. Talbot, Preventing over-fitting during model
selection via Bayesian regularisation of the hyper-parameters,
Journal of Machine Learning Research, volume 8, pages 841-861,
April 2007.
(www,pdf,software)
Leave-one-out cross-validation provides a convenient means for model selection for a wide range of kernel learning methods. However, the relatively high variance of the leave-one-out estimator means that the over-fitting can occur in model selection as well as training, especially if there are a large number of hyper-parameters to be tuned. In this paper, we add a regularisation term penalising large values of the hyper-parameters, which we hope will encourage comparatively smooth, simple models. The additional regularisation parameters are integrated out analytically, so the regularised model selection criterion is not significantly more expensive to evaluate. The experimental evalutaion gives performance competitive with the Expectation-Propagation based Gaussian Process classifier, at a substantially reduced cost. This work was first presented at the NIPS Multi-leve Inference Workshop Whistler, B.C., Canada, December 9, 2006
- G. C. Cawley and N. L. C. Talbot, Gene selection in cancer
classification using sparse logistic regression with Bayesian
regularisation, Bioinformatics, volume 22, number 19,
pages 2348-2355, October 2006.
(doi,
software)
Cancer classification based on micro-array data is one of the classic applications of machine learning techniques in computational biology. The aim is to identify biomarker genes, the expression of which are diagnostic of a particular form of cancer. In this paper, we extend the sparse logistic regression algorithm of Shevade and Keerthi (doi) to provide Bayesian regularisation, where the regularisation parameter controlling sparsity is integrated out analytically, using an uninformative Jeffrey's prior. This not only obviates the need for costly cross-validation based model selection procedures but also entirely eliminates the possibility of selection bias - a common pitfall in this application (e.g. doi).
- G. C. Cawley, N. L. C. Talbot and M. Girolami, Sparse Multinomial
Logistic Regression via Bayesian L1 Regularisation, In Advances
in Neural Information Processing Systems 19, B. Schölkopf,
J. C. Platt and T. Hoffmann (eds), MIT Press, Cambridge MA USA, 2007.
(pdf,software)
This paper describes the generalisation of sparse logistic regression with Bayesian regularisation (BLogReg) to the multinomial case, with a thorough experimental evaluation on a wide range of benchmark datasets. Again, the method is competitive in terms of generalisation with cross-validation based model selection, but is substantially faster.
- G. C. Cawley, Leave-One-Out Cross-Validation Based Model Selection
Criteria for Weighted LS-SVMs, In Proceedings of the International
Joint Conference on Neural Networks (IJCNN-2006), pages 1661-1668,
Vancouver, BC, Canada, July 16-21 2006.
(pdf)
I was a joint-winner (with Roman Lutz) of the WCCI-2006 Model Selection and Performance Prediction Challenge (website), winning two of the five categories (HIVA and NOVA) outright. This paper (which was also awarded the best paper prize) describes the winning method, using the least-squares support vector machine with an efficient model selection procedure based on virtual leave-one-out cross-validation. This method also formed the basis for the reference solutions for the NIPS/IJCNN Agnostic Learning vs. Prior Knowledge competition (website).
- G. C. Cawley and N. L. C. Talbot, Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers, Pattern Recognition, vol. 36,no. 11, pp 2585-2592, November 2003. (pdf, doi)
A full list of my publications (many of which are downloadable) can be found here.
Research Links
Personal
- Nicky, my wife.
- Cameron, my son.
- Colney Cricket Club
- The SYS Cricket Club home page
- Norfolk Cricket League
- CricInfo UK
- RiscOs Software
- A self modifying C program
Research Team: Gavin Cawley, Nicola Talbot, Kamal Saadi.