Introduction
The identification and understanding of transcriptional regulatory networks and their interactions is a major challenge in biology. We use a Relevance Vector Machine (RVM) [2] to classify gene expression according to the composition of promoter sequences. The RVM is used with a Bayesian Automatic Relevance Determination (ARD) prior to select a small subset of promoter motifs for its discriminatory rule to optimally distinguish between regulated genes. The RVM also has the useful property that no parameters are set or statistical assumptions made, such as the threshold of significance of a feature, since the entire model is generated automatically from the data. It also considers the significance of a feature in the context of the features already selected. This makes the application especially suitable for biological problems with many variables of unknown significance that may influence each other.
Sparse Bayesian Learning and The Relevance Vector Machine
In a statistical pattern recognition setting, the Relevance Vector Machine (RVM) [2] is essentially a Bayesian training algorithm for logistic regression models with input feature selection via Bayesian Automatic Relevance Determination. The RVM has a number of advantages over more conventional feature selection algorithms: Firstly there are no hyper-parameters that must be tuned in order to maximised generalisation performance, for instance an arbitrary input feature salience threshold. Secondly the algorithm is not greedy in the sense that a feature picked early in the training procedure may be dropped later if a more compact, more informative combination of features is discovered. Lastly the algorithm is based on sound Bayesian principles (it is a type-II maximum-likelihood procedure), rather than being merely a heuristic. The RVM implementation used here is based on the greedy marginal likelihood maximisation algorithm of Tipping and Faul [4]. An older implementation of the RVM, based on the methods presented in [2] is available from www.relevancevector.com.
Online Bayesian Regulatory Element Detection (BRED) Tool
To perform an analysis of the promoters of a set of co-regulated genes in Arabidopsis, simply enter the gene IDs (e.g. At1g01020, At2g269930 or At5g67620) of the up- and down-regulated genes, title of the experiment, your name and email address in the boxes below and press "submit". A report on the results of the promoter analysis will be emailed to you in due course. A full list of permissible gene IDs is available here genes.txt. At the moment, the tool is limited to only around 100 genes of each category, it seems due to a limitation in the MATLAB Webserver, for which we are currently working on a solution.
Performing the analysis is likely to take 30-40 minutes or so, largely depending on the total number of up- and down-regulated genes. If I were you, I'd go off and make myself a nice cup of tea and then have a look at cricinfo, rather than just sit at my computer patiently waiting for it.
Acknowledgements
This work was supported by grants from the U.K. Biotechnology and Biological Sciences Research Council (BBSRC) under the Exploiting Genomics initiative (grant numbers and 83/EGM16126 and 83/EGM16128 "Computational approaches to identifying gene regulatory systems in Arabidopsis").
References
N.B. DOI represents a link to on-line material via the Digital Object Identifier
DOI system, where available. WWW represents a link
to on-line materials available via the publisher's web-site.
[1] Bussemaker, H. J., Li, H. and Siggia, E.D., "Regulatory element detection
using correlation with expression", Nature Genetics, vol. 27,
pp. 167-171, 2001. (DOI)
[2] Tipping, M. E., "Sparse Bayesian learning and the Relevance Vector
Machine", Journal of Machine Learning Research, vol. 1., pp. 211-244,
June 2001. (WWW)
[3] Li, Y., Lee, K. K., Smith, C., Walsh, C., Sorefan, K., Cawley, G. and Bevan, M., "Establishing Glucose- and ABA-regulated Transcription Networks in
Arabidopsis by Microarray Analysis and Promoter Classification using a
Relevance Vector Machine", Genome Research (submitted), 2005.
[4] Tipping, M. E. and Faul, A. C., "Fast marginal likelihood maximisation for
sparse Bayesian models", In C. M. Bishop and B. J. Frey (Eds.), Proceedings of
the Ninth International Workshop on Artificial Intelligence and Statistics,
Key West, FL, Jan 3-6 2003.
(WWW)
Research Team: Kamel Saadi, Gary Lee, Gavin Cawley, Yunhai Li (JIC), Mike Bevan (JIC).