MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

Transcriptional Factor Recruitment and Gene Regulation in Mammalian Tissues

Kenzie D. MacIsaac, William Gordon, Katherine Romer & Ernest Fraenkel

Introduction

One of the most fundamental problems in biology is to understand genetic regulation. How are hundreds of types of cells encoded in a single genome? How can the static set of genetic instructions produce the dynamic behavior needed to survive in a changing environment? Until recently, insights into gene regulation have been derived from a limited number of genes that have been studied in detail. However, it is now possible to study directly the regulation of all the genes in the genome. The availability of genome-wide data creates new computational challenges in modeling. We are developing approaches to use these data to understand several levels of regulation in eukaryotic cells.

Transcription Factor Specificity

Regulatory proteins known as transcription factors bind to specific sites in the genome and determine whether adjacent genes will be activated or repressed. In some cases, the preferred binding site sequences for these proteins are known. But in many cases, these sequence patterns, known as sequence motifs, must be deduced from noisy experimental data. Motif discovery is a challenging problem for which many algorithms have been developed.

Intelligently combining the predictions of several distinct motif discovery algorithms generally gives more accurate results than relying on a single method [1]. We recently presented a general framework for implementing such a strategy, allowing us to analyze in vivo binding data for almost all of the transcription factors in the yeast Saccharomyces cerevisiae. Combining the motif predictions of six separate algorithms, we presented the first extensive map of regulatory sites for a eukaryotic organism [2]. In later work, and using the same general methodology, we were able to improve this draft regulatory code by designing motif discovery algorithms that made use of evolutionary data derived from the genomic sequences of closely related yeast species, thus further improving performance [3]. We have also made available a set of web-based motif discovery tools that allow our motif discovery strategy to be easily implemented by other researchers called WebMOTIFS .

Motif discovery in mammalian systems is a far more challenging problem than in yeast. Most algorithms have been optimized and tested on yeast data, and their performance on sequence data for higher organisms is generally quite poor [4]. One problem is that mammalian promoters have many competing sequence patterns that can swamp the sequence signal corresponding to the true binding site of the protein. We have developed an algorithm, called THEME, that addresses this problem by restricting the search space to motifs consistent with prior biological knowledge about the protein [5]. Transcription factors can be grouped into families based on the structural characteristics of their DNA-binding domain. We have derived a set of prototypical motifs that represent the DNA sequence specificity of members of many of the most common structural families. THEME tests each of these prototypes, first performing a constrained optimization of the motif and then using the optimized version to build a classifier that distinguishes bound and unbound examples. The best motif is identified by evaluating its classification error on held-out test data. We have demonstrated that this approach works very well for mammalian transcription factors whose DNA-binding domain family is known. Furthermore, even when the family is not known THEME is able to identify both the correct motif and the correct family in the majority of cases by testing hypotheses from all families.

A Unified Biophysical Model

Our Bluebird algorithm extends THEME by using a unified probabilistic model founded on biophysical principles to integrate structural information, experimental binding data from multiple conditions, and sequence information. This approach allows Bluebird to detect changes in the nuclear concentration of a transcription factor between growth conditions and to identify any condition-specific changes in binding specificity. The Bluebird algorithm eliminates the need for arbitrary cutoffs frequently used to identify bound and unbound sequences. We have demonstrated that this model can be extended to analyze binding data for multiple proteins which may bind cooperatively as partners or compete for the same binding site. Using a principled model selection procedure, Bluebird tests each mechanistic hypothesis and determines which model best explains the observed condition-specific binding behavior.

The Connection between Transcription Factor Binding and Gene Expression

The binding of transcription factors to DNA initiates a complicated sequence of events that can alter the expression of adjacent genes. In many cases particular combinations of transcription factors jointly recruit additional proteins known as co-activators and co-repressors. We have begun a project to reveal the rules that determine co-activators and co-repressors recruitment. Our approach is designed to reveal the mechanisms that translate hidden genomic sequence features into diverse regulatory programs.

References:

[1] K.D. MacIsaac and E. Fraenkel. Practical strategies for discovering regulatory DNA sequence motifs. In PLoS Computational Biology, 2006, 2(4): e36

[2] C.T. Harbison and D.B. Gordon et al. Transcriptional regulatory code of a eukaryotic genome. In Nature, 2004 Sep. 2; 431(7004):99-104.

[3] K.D. MacIsaac, T. Wang, D.B. Gordon, D.K. Gifford, G.D. Stormo and E. Fraenkel. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. In BMC Bioinformatics 2006, 7:113

[4] M. Tompa et al. Assessing computational tools for the discovery of transcription factor binding sites. In Nature Biotechnology, 2005 Jan;23(1):137-44.

[5] K.D. MacIsaac et al. A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. In Bioinformatics, 2006 Feb. 15; 22(4):423-9

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu