MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

Topic Models for Gene Expression Analysis

John Barnett & Tommi Jaakkola

Problem

Gene expression levels, as measured in microarray and related experiments, are the result of the combined effect of multiple underlying factors. On the one hand, the biological samples contain a large population of cells, where different subpopulations (such as different tissue types) may themselves exhibit different expression levels. On the other hand, multiple processes of activation and repression are acting across the genome. A recent large-scale gene expression experiment, the Connectivity Map [5], serves to highlight the relationship between cell types, cellular processes, and gene expression: biologically active small molecules (think: drugs) are used to treat a variety of cell lines, measuring the response of gene expression. Disentangling these interactions is the goal of our present work.

Motivation

Gene expression profiling provides a window into the inner workings of the cell, as exhibited through messenger RNA levels. Drugs impact different pathways within the cell, and the active pathways may be different within different cell types. Understanding more precisely which pathways are affected by drugs in specific cell types opens up new possibilities for more targeted drugs or combination drug therapies. For instance, a drug which influences multiple pathways, some of which are undesirable, might have its side effects mitigated by another drug whose effects overlap.

Approach

We propose to adapt topic models, along the lines of Latent Dirichlet Allocation (LDA) [3], Hierarchical Dirichlet Processes (HDP) [7], and related matrix factorization approaches such as Convex Matrix Factorization (CMF) [2] and Principal Components Analysis (PCA) (see [6] for a microarray application), to the application of inferring the pathways affected by drugs in different cell types, from gene expression data. LDA and HDP assume that documents (represented as a vector of word counts) arise from a mixture of topics, whose number is fixed in LDA, and variable in HDP; changing the topic allows different subsets of words to be prominent. Similarly, matrix factorization methods can be viewed as topic models for continuous data, and have been previously applied to analysis of gene expression data [2, 4, 6, 1]. The vector of expression values across the genome is then viewed as a linear combination of the underlying factors; PCA and CMF both assume additionally that noise follows a Gaussian distribution. We are developing a topic model that combines aspects of both classes of models; by discretizing data, we can consider the discrete counts of genes in a manner analogous to words in a document. At the same time, we assume that the topic selection is governed by a log-linear response to the input variables, namely drug and cell type. Our assumption is that topics correspond (loosely) to pathways, which are active to varying degrees in different cell types, and which are affected differently by various drugs. The combination of active pathways is then reflected in gene expression, and our goal is to infer the pathways, which drugs affect them, and how this varies with cell type.

We are also developing an alternative model in which the data are represented solely through their relative ranking, which is precisely the information which would be preserved through an arbirtray choice of a monotonic normalization function, thus side-stepping the question of how the data are normalized for comparison. To do so, we are aiming to define a parameterized family of distributions over orderings, such that the inferred parameters are easily interpretable and the distribution is tractable and provides control over complexity. Our initial attempts at this are based on summarizing orderings through partial orderings; however, while the partial order is easily interpretable, there are tractability issues associated with counting the number of orderings that are consistent with a given partial order. We can encode this as a graph perfect match problem, which is #P-complete, but the actual complexity is not known.

Support

The authors acknowledge support from NSF ITR grant 0428715.

References

[1] Orly Alter, Patrick O. Brown, and David Botstein. Singular value decomposition for genome-wide expression data processing and modeling. Proc Nat Acad Sci, 97(18):10101–10106, 2000.

[2] John D. Barnett. Convex matrix factorization for gene expression analysis. Master's thesis, MIT, 2003.

[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. In Neural Information Processing Systems 14, 2001.

[4] Neal S. Holter, Madhusmita Mitra, Amos Maritan, Marek Cieplak, Jayanth R. Banavar, and Nina V. Fedoroff. Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc Nat Acad Sci, 97(15):8409–8414, 2000.

[5] J. Lamb, E.D. Crawford, D. Peck, J.W. Modell, I.C. Blat, M.J. Wrobel, J. Lerner, J.P. Brunet, A. Subramanian, K.N. Ross, et al. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science, 313(5795):1929, 2006.

[6] S. Raychaudhuri, J. Stuart, and R. Altman. Principal components analysis to summarize microarray experiments: application to sporulation time series. In Pacific Symposium on Biocomputing, vol. 5, 2000.

[7] Yee-Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical dirichlet processes. In Advances in Neural Information Processing Systems, 2004.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu