Introduction
Cancer classification based on micro-array data is one of the classic applications of machine learning in computational biology. The aim is to identify biomarker genes, the expression of which are diagnostic of a particular form of cancer. In this project, we extended the sparse logistic regression algorithm of Shevade and Keerthi [1] to provide Bayesian regularisation, where the regularisation parameter controlling sparsity is integrated out analytically, using an uninformative Jeffrey's prior (c.f. [2]). This not only obviates the need for costly cross-validation based model selection procedures but also entirely eliminates the possibility of selection bias - a common pitfall in this application [3]. The results obtained using this approach are competitive with those obtained via cross-validation based regularisation and the with the Relevance Vector Machine (RVM) [4,5] on the well known colon cancer [6] and leukaemia [7] benchmark datasets. This page contains supplementary information for [8].
Software
A MATLAB implementation of the BLogReg algorithm is made available for research purposes under the GNU General Public License (GPL). For efficiency, it is implemented as a C-language MEX file. If you download both files then, provided you have configured mex correctly, blogreg should automatically (an transparently) compile itself the first time it is executed.
- Bayesian logistic regression routine (blogreg.c)
- Online help file (blogreg.m)
- Minimal demonstration using Leukaemia dataset (demo.m)
- MAT file containing the Leukaemia dataset (leukaemia.mat)
Acknowledgements
This work was supported by grants from the U.K. Biotechnology and Biological Sciences Research Council (BBSRC) under the Exploiting Genomics initiative (grant numbers and 83/EGM16126 and 83/EGM16128 "Computational approaches to identifying gene regulatory systems in Arabidopsis"). N.B. the BLogReg algorithm was originaly developed for use in detection of transcription factor binding sites, see e.g. [9], this work in currently in progress.
References
N.B. doi represents a link to on-line material via the Digital Object Identifier DOI system, where available, while www represents a link to on-line materials available via the publisher's web-site.
[1] S. K. Shevade and S. S. Keerthi, "A simple and efficient algorithm for
gene selection using sparse logistic regression", Bioinformatics,
vol. 19, no. 17, pp. 2246-2253, 2003.
(doi)
[2] Buntine, W. L. and Weigend, A. S., "Bayesian backpropagation", Complex Systems, vol. 5, pp 603-643, 1991.
[3] Ambroise, C. and McLachlan, G. J., "Selection bias in gene extraction on
the basis of microarray gene-expression data", PNAS, vol.99, pp. 6562-6566, 2002.
(doi)
[4] M. E. Tipping, "Sparse Bayesian learning and the Relevance Vector
Machine", Journal of Machine Learning Research, vol. 1., pp. 211-244,
June 2001. (www)
[5] M. E. Tipping, and A. C. Faul, "Fast marginal likelihood maximisation for
sparse Bayesian models", In C. M. Bishop and B. J. Frey (Eds.), Proceedings of
the Ninth International Workshop on Artificial Intelligence and Statistics,
Key West, FL, Jan 3-6 2003.
(www)
[6] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield and E. S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring", Science, vol. 286, no. 5439, pp. 531-537, 15 October 1999.
(doi)
[7] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack and A. J.
Levine, "Broad patterns of gene expression revealed by clustering analysis of
tumor and normal colon tissues probed by oligonucleotide arrays", PNAS,
vol. 96, no. 12, pp. 6745-6750, June 1999.
(www)
[8] G. C. Cawley and N. L. C. Talbot, "Gene selection in cancer classification
using Sparse Logistic Regression with Bayesian Regularisation", Bioinformatics, volume 22, number 19, pages 2348-2355, October 2006.
(doi,
pdf)
[9] Y. Li, K. K. Lee, S. Walsh, C. Smith, S. Hadingham, K. Sorefan, G. Cawley
and M. W. Bevan, "Establishing glucose- and ABA-regulated transcription
networks in Arabidopsis by microarray analysis and promoter classification
using a Relevance Vector Machine", Genome Reseach, vol. 16, no. 3,
pp. 414-427, March 2006.
(doi)
Research Team: Gavin Cawley, Nicola Talbot.