UEA Computational Biology Laboratory

Contact Information

School of Computing Sciences
University of East Anglia
Norwich, NR4 7TJ, U.K.
gcc@cmp.uea.ac.uk
+44 (0) 1603 593258 (tel)
+44 (0) 1603 593344 (fax)
Departmental webpage

Research

My principal research interests lie in machine learning, with a particular emphasis on Bayesian and kernel learning methods. I am most interested in theoretical issues and algorthms with a direct impact in the practical application of machine learning techniques, including topics such as feature selection, model selection, performance estimation, model comparison, covariate shift, dealing with imbalanced or "non-standard" data and semi-supervised learning. Most of my applied work centres on problems arising in computational biology, in collaboration with the School of Chemistry and Pharmacy (CAP) and with the nearby John Iness Centre (JIC) and Institute for Food Reasearch (IFR). However I also have long-standing research links with the School of Environmental Sciences (ENV) and the Climatic Research Unit (CRU), working on applications of machine learning in the environmental sciences, particularly on modelling and exploiting predictive uncertainty.

Projects

Bayesian Gene Selection for Cancer Classification
Drawing an Elephant with Four Parameters
Identification of Disordered Proteins
Bayesian Regulatory Element Detection (BRED)
Modelling the Growth Domain of Microbial Pathogens
Predictive Uncertainty in Environmental Modelling
Interpretation of Archaeological Survey Data

Collaborators

Grants

MRC discipline-hopping award, grant 67192 - "Predicting protein disorder with advanced machine learning tools", September 2004 - August 2005.
BBSRC grant 83/EGM16128 - "Computational approaches to identifying gene regulatory systems in Arabidopsis", January 2002 - January 2006.
BBSRC grant D17534 - "Kernel methods for growth domain modelling and experimental design", April 2002 - December 2003.
Royal Society grant 22270 - "Large-scale kernel methods", April 2001 - December 2001.

Prizes

Joint winner of the ICANN-2009 environmental toxicity prediction challenge.
(homepage, results)
WCCI-2008 causation and prediction challenge, winner on SIDO and MARTI problems. (www)
WCCI-2008 Ford classification challenge, winner on dataset B. (www)
IJCNN-2007 agnostic learning vs. prior knowledge challenge, best paper. I also provided some of the baseline submissions for this challenge, and for the related NIPS-2006 model selection game, that went unbeaten ("interim all prior" and "the bad" respectively). (www)
Joint winner of the WCCI-2006 performance prediction challenge, outright winner on tasks HIVA and NOVA and best paper. (www)

Selected Publications

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (www, pdf)
G. C. Cawley and G. J. Janacek, On allometric equations for predicting body mass of dinosaurs, Journal of Zoology, vol. 280, no. 4, pages 355-361, April 2010. (doi, pdf,supplementary material)
G. C. Cawley and N. L. C. Talbot, Efficient approximate leave-one-out cross-validation for kernel logistic regression, Machine Learning, vol, 71, no. 2-3, pp. 243--264, June 2008. (doi,pdf)
G. C. Cawley and N. L. C. Talbot, Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007. (www,pdf,software)

Leave-one-out cross-validation provides a convenient means for model selection for a wide range of kernel learning methods. However, the relatively high variance of the leave-one-out estimator means that the over-fitting can occur in model selection as well as training, especially if there are a large number of hyper-parameters to be tuned. In this paper, we add a regularisation term penalising large values of the hyper-parameters, which we hope will encourage comparatively smooth, simple models. The additional regularisation parameters are integrated out analytically, so the regularised model selection criterion is not significantly more expensive to evaluate. The experimental evalutaion gives performance competitive with the Expectation-Propagation based Gaussian Process classifier, at a substantially reduced cost. This work was first presented at the NIPS Multi-leve Inference Workshop Whistler, B.C., Canada, December 9, 2006
G. C. Cawley and N. L. C. Talbot, Gene selection in cancer classification using sparse logistic regression with Bayesian regularisation, Bioinformatics, volume 22, number 19, pages 2348-2355, October 2006. (doi, software)

Cancer classification based on micro-array data is one of the classic applications of machine learning techniques in computational biology. The aim is to identify biomarker genes, the expression of which are diagnostic of a particular form of cancer. In this paper, we extend the sparse logistic regression algorithm of Shevade and Keerthi (doi) to provide Bayesian regularisation, where the regularisation parameter controlling sparsity is integrated out analytically, using an uninformative Jeffrey's prior. This not only obviates the need for costly cross-validation based model selection procedures but also entirely eliminates the possibility of selection bias - a common pitfall in this application (e.g. doi).
G. C. Cawley, N. L. C. Talbot and M. Girolami, Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation, In Advances in Neural Information Processing Systems 19, B. Schölkopf, J. C. Platt and T. Hoffmann (eds), MIT Press, Cambridge MA USA, 2007. (pdf,software)

This paper describes the generalisation of sparse logistic regression with Bayesian regularisation (BLogReg) to the multinomial case, with a thorough experimental evaluation on a wide range of benchmark datasets. Again, the method is competitive in terms of generalisation with cross-validation based model selection, but is substantially faster.
G. C. Cawley, Leave-One-Out Cross-Validation Based Model Selection Criteria for Weighted LS-SVMs, In Proceedings of the International Joint Conference on Neural Networks (IJCNN-2006), pages 1661-1668, Vancouver, BC, Canada, July 16-21 2006. (pdf)

I was a joint-winner (with Roman Lutz) of the WCCI-2006 Model Selection and Performance Prediction Challenge (website), winning two of the five categories (HIVA and NOVA) outright. This paper (which was also awarded the best paper prize) describes the winning method, using the least-squares support vector machine with an efficient model selection procedure based on virtual leave-one-out cross-validation. This method also formed the basis for the reference solutions for the NIPS/IJCNN Agnostic Learning vs. Prior Knowledge competition (website).
G. C. Cawley and N. L. C. Talbot, Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers, Pattern Recognition, vol. 36,no. 11, pp 2585-2592, November 2003. (pdf, doi)

A full list of my publications (many of which are downloadable) can be found here.

Research Links

Personal

Research Team: Gavin Cawley, Nicola Talbot, Kamal Saadi.

UEA Computational Biology Laboratory

On this Site

Related Websites

Contact Information

Research

Projects

Collaborators

Grants

Prizes

Selected Publications

Research Links

Personal

UEA Computational Biology Laboratory

On this Site

Related Websites

Dr Gavin C. Cawley

Contact Information

Research

Projects

Collaborators

Grants

Prizes

Selected Publications

Research Links

Personal