Regularized Boosting for BioinformaticsLior Wolf & Ian S. MartinMotivationBoosting techniques, i.e. the iterative combination of weak classifiers to build a strong classifier, are extremely popular in many domains where one wishes to learn effective classification functions and identify the most relevant variables at the same time. However, in general, these techniques do not perform well on bioinformatics data sets, since they typically contain very few training examples, especially considering the large number of variables. Most boosting algorithms tend to overfit on such data sets. Our goal is to present variants of boosting techniques that work better in these settings. Another popular algorithm, the Support Vector Machine, works well on all data sets, regardless of size, but fails to select the important features for use in designing biological markers. Related WorkModifications to the most popular boosting algorithms, AdaBoost[8] and GentleBoost[5], have been suggested in the past, namely by Long and Vega in relation to boosting and microarray data[7]. Following previous work [4] that showed that boosting is not well-suited for expression data, they developed several deterministic modifications of AdaBoost, and showed that these variants perform much better than the traditional AdaBoost on microarray data. In order to compare their results to ours, we implemented their two most successful variants: VC and NR. In AdaBoost-VC (the initials VC imply that the form of regularization is similar to the one derived using VC bounds on generalization error), the empirical weighted error of the classifier chosen at round t is adjusted prior to reweighting. The other variant, AdaBoost-NR ("No Repeat"), has two modifications over the usual AdaBoost algorithm. First, each gene (variable) is constrained for use in at most one decision stump. Second, a decision stump with an empirical error of zero, is weighted in the final classifier as if it has an empirical error of 0.1/m, where m is the number of examples. We focus on boosting over decision stumps[1] and GentleBoost, so we implemented two variants of GentleBoost that were inspired by Long and Vega. The simpler one is GentleBoost-NR in which, like AdaBoost-NR, only one stump per gene is included; the VC version of GentleBoost was adapted to accommodate the basic differences between AdaBoost and GentleBoost. For example, AdaBoost uses the global error estimate to update the weights of all the correctly classified training examples, whereas GentleBoost updates each example's weight individually by considering the error of its best base classifier. Noise Injection was a technique used by researchers in the mid-1990s to improve the generalization abilities of any classifier[3]. Noise injection creates multiple copies of the training examples, each perturbed by zero-mean, low-variance Gaussian noise. The motivation is that if two data points are close, we would like them to be classified identically. By introducing many examples with similar values and identical labels, we teach the classifier to have better stability properties. Hence, the learned function is encouraged to be smooth (at least around the training points). The study of the noise injection technique established the following results on noise injection: (1) It is an effective way to reduce generalization error. (2) It has a similar effect on shrinkage (the statistical term for regularization) of the parameters in some simple models. (3) It is equivalent to Tikhonov regularization[2]. Note that this does not mean that we can always use Tikhonov regularization instead of noise injection, as for some learning algorithms it is not possible to create a regularized version. ApproachThe technique we introduce is similar in spirit to noise injection. However, it is different enough that the results obtained for noise injection will not hold for it. Like noise injection, our technique is based on a stochastic regularization method, but it is inherently different from deterministic regularization. For example, the results of[2] use a Taylor expansion around the original data points. Such an approximation will not hold for our new technique, since the noise is too large (i.e., the new datapoint is too different). Other important properties that might not hold are the independence of noise across coordinates, and the zero mean of the noise. Our regularization technique is based on creating corrupted copies of the dataset. Each new data point is a copy of one original training point, picked at random, where one random coordinate (feature) is replaced with a different value (usually the value of the same coordinate in another random training example). We call it the Stochastic Regularization (SR) procedure. It is repeated many times to create new examples and can be used with any learning algorithm. The SR technique is especially suited for use when learning from only a few examples. The robustness we demand from the selected classification function is much more than local smoothness around the classification points (c.f. noise injection). This kind of smoothness is easy to achieve when example points are far from one another. Our regularization, however, is less restrictive than demanding uniform smoothness (Tikhonov) or requiring the reduction of as many parameters as possible. Both of these approaches might not be ideal when only a few examples are available because there is nothing to balance a large amount of uniform smoothness, and it is easy to fit a model that uses very few parameters. Instead, we encourage redundancy in the classifier since, in contrast to the shortage of training examples, there is an abundance of features. ResultsWe compared the performance of our technique on biological data sets including gene expression (microarray), NMR spectroscopy, and high-throughput docking (HTD) to show that our technique is comparable and even superior to previously suggested modifications of AdaBoost. The GentleBoost-SR algorithm performed well for all three types of data. In HTD classification, one tries to predict the activity level of compounds against a protein target of interest in an attempt to identify novel compounds that elicit a desired biological response. A typical data set contains millions of compounds, and the number of features (describing the chemical and geometrical properties of the compounds) is in the thousands. The initial experiments we present are on a data set recommended to us by[6]. In general, our results indicate that: (1) The SR procedure helps GentleBoost, raising it to the same level of performance as SVM. (2) SR seems to help AdaBoost as well, but not always. (3) As expected SR produces classifiers that tend to use more features. (4) SR shows different, mostly better, performance than noise injection (GentleBoost-NI). ImpactNR and SR variants seem to perform better than other GentleBoost variants through encouraging redundancy in the resulting classifier. Where traditional boosting algorithms tend to overfit, GentleBoost-SR works well regardless of the amount of training examples. The VC variants of both AdaBoost and GentleBoost did not perform as well on the datasets considered. Our experimental evidence also shows that SR is difficult to immitate in a deterministic manner. Noise injection was used to regularize unstable algorithms a decade ago; in its new form (SR) it can be used to prevent the overfitting of state-of-the-art algorithms. It enables these classifiers to be used on bioinformatics data sets that are noisy and contain very few examples. AcknowledgmentsThis report describes research done at the Center for Biological & Computational Learning, which is in the McGovern Institute for Brain Research at MIT, as well as in the Dept. of Brain & Cognitive Sciences, and which is affiliated with the Computer Sciences & Artificial Intelligence Laboratory (CSAIL).This research was sponsored by grants from: Office of Naval Research (DARPA) Contract No. MDA972-04-1-0037, Office of Naval Research (DARPA) Contract No. N00014-02-1-0915, National Science Foundation (ITR/SYS) Contract No. IIS-0112991, National Science Foundation (ITR) Contract No. IIS-0209289, National Science Foundation-NIH (CRCNS) Contract No. EIA-0218693, National Science Foundation-NIH (CRCNS) Contract No. EIA-0218506, and National Institutes of Health (Conte) Contract No. 1 P20 MH66239-01A1. Additional support was provided by: Central Research Institute of Electric Power Industry (CRIEPI), Daimler-Chrysler AG, Compaq/Digital Equipment Corporation, Eastman Kodak Company, Honda R&D Co., Ltd., Industrial Technology Research Institute (ITRI), Komatsu Ltd., Eugene McDermott Foundation, Merrill-Lynch, NEC Fund, Oxygen, Siemens Corporate Research, Inc., Sony, Sumitomo Metal Industries, and Toyota Motor Corporation. References[1] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schlummer, and Z. Yakhini. Tissue classification with gene expression profiles. Journal of Computational Biology, Vol. 7, pp. 559--584, 2000. [2] C.M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, Vol. 7, pp. 108, 1995. [3] L. Breiman. Heuristics of instability and stabilization in model selection. Ann. Statist., Vol. 24, pp. 2350--2383, 1996. [4] S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, pp. 77--87, 2002. [5] J. Friedman, Trevor Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Ann. Statist., Vol. 28, pp. 337--407, 2000. [6] Anthony E. Klon, Meir Glick and John W. Davies. Application of Machine Learning to Improve the Results of High-Throughput Docking Against the HIV-1 Protease. J. Chem Inf. Comput. Sci., Vol. 44, pp. 2216-2224, 2004. [7] P.M. Long and V. B. Vega. Boosting and microarray data. Machine Learning, Vol. 52, pp. 31--44, 2003. [8] Robert Schapire. A breif introduction to boosting. IJCAI, pp. 1401--1406, 1999. |
||
|