CSAIL Research Abstracts - 2005 link to http://publications.csail.mit.edu/abstracts/abstracts05/index.html link to http://www.csail.mit.edu
bullet Introduction bullet Architecture, Systems
& Networks
bullet Language, Learning,
Vision & Graphics
bullet Physical, Biological
& Social Systems
bullet Theory bullet

horizontal line

Prediction of Tissue Specific Alternatively Spliced Exons

Neha Soni, Gene Yeo, Tomaso Poggio & Christopher B. Burge


Alternative splicing plays a major role in protein diversity and regulating gene expression in higher Eucaryotes[1], and more than half of the human genes are known to be alternatively spliced.  Splicing is regulated by the interactions of protein factors with the splicing machinery, where the factors bind to elements in the regulated exons or the flanking introns.  It is widely believed that alternative splicing is of particular importance in the nervous system. Defects in the splicing machinery have been known to cause a substantial fraction of human genetic diseases[3].  But very little is actually known about the precise mechanism behind alternative splicing.  In most cases, it is not clear what causes exons to be differentially skipped in different tissues.  In fact, a very small percentage of all alternatively spliced exons have been discovered and confirmed.  We aim to use microarray data for exon-exon junctions in different tissues to predict and discover new alternatively spliced exons using machine learning and statistical techniques, and to try to find splicing patterns across tissues.

Previous Work

Alternative splicing has been studied by looking at Expressed Sequence Tags (EST's), as well as sequence data. ACEScan[2] is an exon classification algorithm that was recently developed, and it uses sequence features to distinguish between evolutionary conserved alternatively spliced exons (ACEs Alternative-conserved exons) and other orthologous human/mouse exons.  Studying EST's has significant limitations because of bias in transcript coverage and non-uniformity of tissue libraries or sampling[3].  Reverse Transcriptase polymerase chain reaction (RT-PCR) provides a more concrete proof of the alternative splicing of the exon, but it is labor-intensive, and is preferred when analysing only a few genes in a small number of tissues.

Recent experiments have illustrated that microarray or fiber-optic array probes can monitor splicing events as well.  The probes are positioned at exon-exon junctions across 52 tissue samples for each gene.  The paper by Johnson et al.[5] analyzed the hybridization intensity data to identify tissue-specific differences.  They modeled an expected background model for the intensities that assumed the presence of all exons in the gene, and this model was subtracted from the normalized observed intensities.  Those with the greatest deviation from the normal expected hybridization values were labeled as being possibly skipped and some of these predictions were validated using RT-PCR.

Prediction of Alternative Splicing Events

We are trying to discover new alternatively spliced exons within human genes with the help of microarray data.  Microarray data tends to be very noisy, and this particular data has a lot of variables associated with it.  The gene expression levels in the various tissues are not known to us.  In addition, probes show vast differences in their binding power within the genes.  There are issues with probes binding to other parts of the gene in addition to the junction they represent, and this leads to misrepresented hybridization intensities on the chip.  We hope to get newer microarray data with additional probes which will possibly result in a better alternatively spliced exon prediction rate.

We took into account the correlation between adjacent juntion probes in the event of exon skipping.  This correlation wasn’t explicitly modeled in the model used by Johnson et al.[5].  We also wanted to predict exons that didn’t have a junction probe because they weren’t known to exist, and so are present within the intron.  Due to differences in gene content in various tissues, and also in probe binding affinity, pre-normalization of the hybridization intensities turned out to be a very important first step.  Given the nature of the microarray intensity data, it made sense to approximate the gene abundance in each tissue by the geometric mean of the intensities of all junction probes in the gene present in that tissue.  Each junction intensity was then normalized according to this approximation.  Similarly, variation in hybridization intensities due to defects and differences in the probe were reduced by normalizing according to the approximated probe binding affinity.  By using clustering approaches, we gave a score (netta score) to each exon in the gene, depending on the confidence with which it is skipped in the group of 52 tissues that we are looking at.

We also want to study the correlation between tissues in their splicing patterns for various genes.  The uncanny resemblance of microarray intensity values to pixel values in an image suggests using a clustering approach to search for spliced exons.  We treat each hybridization sequence for a particular tissue as a vector whose dimensions are the different exon-exon junctions.  We separate the tissues into clusters depending on simlarities in the way exons are spliced.  The problem with the approach is that it is not known how many tissue clusters, if any, could be found in that gene.  We will use nonnegative matrix factorization[6] to identify distinct splicing patterns and to predict the classes, based on the idea used to recover information from cancer-related microarray data[7].  This will help us answer questions about whether there exist blocks of tissues behaving similarly in the way alternative splicing occurs in different genes.  With the netta score predictions that we have, we are comparing them with the results of EST predictions for alternatively spliced exons as well as for exons found within introns based on work done in the Burge lab, and the resulting predictions will be verified using RT-PCR.  We will use 8 different tissues including brain, fetal brain, liver, lung, kidney, testis and bone marrow.

In the ACEScan approach[2], the set of human EST's and cDNA's which can be reliably aligned to a human gene locus overlapping a particular exon were considered, and this was subdivided into transcripts which include the exon or skip the exon in question.  Similarly the subdivision was done for the mouse transcripts that align to exons of hte orthologous mouse gene.  Each human/mouse exon pair was then assigned to one of four categories depending on whether the exon was oberved to be skipped only in human transcripts, only in mouse, in both or in none.  Alternative inclusion and exclusion of exons is known to be affected by various factors like length of intron and exon, pre-mRNA secondary structure, splice site strength, presence of exonic and intronic splicing enhancers as well as silencers. We thus used the set of exons that have high netta scores and the ones that have low netta scores to try to distinguish feature motifs that enhance splicing events in the exon.  This will be helpful in predicting new tissue-specific alternatively spliced exons.


This report describes research done at the Center for Biological & Computational Learning, which is in the McGovern Institute for Brain Research at MIT, as well as in the Dept. of Brain & Cognitive Sciences, and which is affiliated with the Computer Sciences & Artificial Intelligence Laboratory (CSAIL).  This research was sponsored by grants from: Office of Naval Research (DARPA) Contract No. MDA972-04-1-0037, Office of Naval Research (DARPA) Contract No. N00014-02-1-0915, National Science Foundation (ITR/SYS) Contract No. IIS-0112991, National Science Foundation (ITR) Contract No. IIS-0209289, National Science Foundation-NIH (CRCNS) Contract No. EIA-0218693, National Science Foundation-NIH (CRCNS) Contract No. EIA-0218506, and National Institutes of Health (Conte) Contract No. 1 P20 MH66239-01A1.  Additional support was provided by: Central Research Institute of Electric Power Industry (CRIEPI), Daimler-Chrysler AG, Compaq/Digital Equipment Corporation, Eastman Kodak Company, Honda R&D Co., Ltd., Industrial Technology Research Institute (ITRI), Komatsu Ltd., Eugene McDermott Foundation, Merrill-Lynch, NEC Fund, Oxygen, Siemens Corporate Research, Inc., Sony, Sumitomo Metal Industries, and Toyota Motor Corporation.


[1] Yeo G, Holste D, Nostrand EV, Poggio T, Burge CB. Predictive identification of alternative splicing events conserved in human and mouse.

[2]Yeo G. Identification, Improved Modeling and Integration of Signals to Predict Constitutive and Alternative Splicing, 2004.

[3] J. Castle et al. Optimization of oligonucleotide arrays and RNA amplification protocols for analysis of transcript structure and alternative splicing. In Genome Biology, R66, 2003.

[4] B. K. Dredge, A. D. Polydorides and R. B. Darnell. The splice of life: Alternative splicing and neurological disease. In Nature, Vol. 2, January 2001.

[5] J. Johnson et al. Genome-Wide survey of human alternative pre-mRNA splicing with exon junction microarrays. In Science, Vol. 302, December 2003.

[6] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative factorization. In Nature, Vol.302, December 2003.

[7] Jean-Philippe Brunet, Pablo Tamayo, Todd R. Golub, and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. In PNAS, Vol. 101, March 2004.

horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)