(Left) Visualization of human/mouse codon substitution rates show distinctive structure in protein-coding regions, reflecting evolutionary pressure to preserve amino acid properties; no such structure is present in noncoding regions (lower pane). (Middle) Schematic of sequence alignment of a portion of human chromosome 17 shows three likely protein coding exons, based on sequence conservation and codon substitution patterns. (Right) Two of these exons (B,D) are already known while (C) is not present in the RefSeq or Ensembl databases. However, mRNA and EST experimental data confirm that (C) is a novel alternatively spliced exon. |
Human: In the human genome, we have identified nearly 3,000 novel protein-coding exons for which there is strong comparative evidence, using alignments with the dog, mouse, and rat genomes. These are likely to correspond to previously unknown alternatively spliced exons as well as potentially hundreds of entirely novel genes. A sample of these was tested and confirmed using RT-PCR. Our analysis also casts doubt on hundreds of current gene annotations, including a small number in the well-annotated ENCODE regions. In addition to supporting or rejecting entire genes, our approach suggests many corrections to existing gene annotations, such as adding or removing individual exons.
Fruit fly: In D. melanogaster, an important model organism, we have likewise used alignments with up to 8 other Drosophila species to identify more than 500 novel exons, both within existing gene annotations and in entirely novel regions of the genome. Our analysis also suggests that at least 500 presently annotated genes are not real; most of these are single-exon annotations which indeed have marginal support from experimental data. Our analysis is contributing to improvements in the official fly genome annotation through an ongoing collaboration with FlyBase.
Fungi: We are using our methods to refine the gene annotations for C. albicans, a common human pathogen, and to assist in the initial annotation of several newly sequenced Candida species, including C. tropicalis, C. guilliermondii, and C. lusitaniae. In the yeast S. cerevisiae, we use alignments with other Saccharomyces species to analyze short genes (25-100 amino acids) that have previously been overlooked by comparative and other methods.
[1] C. Burge and S. Karlin. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268:78-94, 1997.
[2] I. Korf, P. Flicek, D. Duan, and M.R. Brent. Integrating genomic homology into gene structure prediction. Bioinformatics 17 Suppl 1:S140-148, 2001.
[3] G. Parra, P. Agarwal, J.F. Abril, T. Wiehe, J.W. Fickett, and R. Guigo. Comparative gene prediction in human and mouse. Genome Res. 13:108-117, 2003.
[4] A. Siepel and D. Haussler. Computational identification of evolutionarily conserved exons. Prec 8th 177-186, 2004.
[5] S.S. Gross and M.R. Brent. Using multiple alignments to improve gene prediction. Prec 8th 375-388, 2005.
Computer Science and Artificial Intelligence Laboratory (CSAIL) The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA tel:+1-617-253-0073 - publications@csail.mit.edu |