Semiparametric Convex Matrix Factorization for ClassificationJohn Barnett & Tommi JaakkolaGene expression data as measured using microarrays are population level observations, as on the order of 10^{6} cells are required to provide enough mRNA for measurement. We propose two variations on convex matrix factorization (CMF, [1][2]) specifically for classification from population means, which we apply to classification of cancer from gene expression. Given a matrix of gene expression data, X, where each row is a microarray experiment, and each column a gene, we can approximate X by a matrix product U*V, to minimize the residual squared error. With no constraints on U or V, this is solved using singular value decomposition (SVD), and is closely related to principal components analysis (PCA). However, we consider constraints on U and V appropriate to the gene expression problem: interpreting the components as cell types, U can be interpreted as the proportion of cell types in each sample, so that rows are nonnegative and sum to one (hence convex); V can be understood as the RNA level for each (gene, cell type) combination, and is thus nonnegative. We use raw, that is, not logtransformed, measurements since the mixing occurs at the level of RNA counts. With these constraints, the error minimization problem may have local optima, but an approximate solution can be found through iterative optimization. CMF alone, however, is not tailored to the problem of classification. Thus, we have developed two extensions to CMF specifically targeted at classification, one inspired by the support vector machine (SVM) and the other a nonparametric Bayesian model akin to hierarchical Dirichlet processes. One of the core ideas of the support vector machine is that of maximizing the margin [3], that is, the distance from points to the decision boundary. We introduce a linear classifier into the CMF problem, and augment the original optimization objective with a regularized, weighted margin maximization term. In the limit as the weight goes to zero, the problem is equivalent to first solving CMF, and subsequently using the cell type distribution in an SVM. In the second approach, we develop a Bayesian nonparametric generative model, initially paralleling CMF. In a probabilistic setting, CMF can be viewed as describing the data using using a Gaussian distribution, with unknown (constrained) parameters U and V. By providing prior distributions for U and V, we can cast this in a Bayesian setting. In particular, since each row of U is a distribution, the Dirichlet is a possible choice for a prior; and gene expression levels have been observed to roughly follow an exponential distribution. To adapt this for classification, we incorporate variables for the class labels, and adopt classspecific priors for U. Finally, we borrow ideas from hierarchical Dirichlet processes [4], used for classification of grouped data, to develop a nonparametric version which no longer requires the specification of the number of cell types in the model. References[1]Daniel D. Lee & H. Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788791, 1999. [2]John D. Barnett. Convex Matrix Factorization for Gene Expression Analysis. Master's Thesis, MIT, 2003. [3]Nello Cristianini & John ShaweTaylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [4]YeeWhye Teh, Michael I. Jordan, Matthew J. Beal & David M. Blei. Hierarchical Dirichlet Processes. In NIPS, 2004. 

