CSAIL Publications and Digital Archive header
bullet Research Abstracts Home bullet CSAIL Digital Archive bullet Research Activities bullet CSAIL Home bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2007
horizontal line

horizontal line

vertical line
vertical line

Predicting Protein and RNA Structures and Interactions using Biophysically-Motivated Energy Functions and Machine Learning

Charles W. O'Donnell, Mieszko Lis, Jérôme Waldispühl & Srinivas Devadas

This project focuses on two related but distinct goals: predicting the set of probable low-energy in-vivo structural conformations for proteins and RNA, and using this information to hypothesize structural protein-protein and protein-RNA interactions. To do this we construct energy functions based on biophysical properties, tune them using machine learning, and determine our predicted structure according to them using dynamic programming optimization. Current progress on this project can be found at http://protein.csail.mit.edu/.

Protein and RNA Structures

Proteins are the building blocks of life as we know it. These microscopic 3-dimensional structures inhabit living cells and serve as the machinery responsible for almost all molecular processes within an organism, such as metabolism, signal transduction, gene regulation, and cell replication. Despite the fact that proteins perform such critical functions, each protein is only composed of a single sequential chain of a few hundred amino acids that are folded back onto themselves into a 3-dimensional packing. It is this 3-dimensional shape of amino acid chains that ultimately determines the functionality of a protein.

RNA macromolecules are also essential to the basic functions of life. Classically, RNA was believed to be a chain of nucleic acids that transports information from DNA to ribosomes in a cell to allow for protein synthesis. However, some RNA has also been shown to form 3-dimensional structures that perform biological functions similar to those of proteins. Just as with proteins, the 3-dimensional structure of an RNA macromolecule determines its function in the cell and is composed of a sequential chain of nucleotides folded back onto themselves forming pairings between bases. Indeed, some believe that, before proteins evolved, RNA molecules handled the functions proteins perform today. Since the functionality of proteins and RNA is so important to life itself, determining the structure of every existing protein and RNA remains one of the core problems of modern biology.


Existing Approaches for Protein and RNA Structure Determination

Experimental evidence suggests that the 3-dimensional structure of a protein or RNA is entirely dependent on the corresponding amino acid or nucleotide base sequence. For example, when a protein is placed in a different environment than normal, it can unfold into a single long strand of amino acids. When the protein is returned to its original environment, it usually readopts its original 3-dimensional configuration.

For decades researchers have used this assumption to try to determine the 3-dimensional structure of a protein given only its amino acid sequence. The most successful approaches work by matching the sequence of a target protein whose structure we wish to determine to those of proteins whose structures are already known. To find these matches, one can search the Protein Data Bank (PDB), which contains over 40,000 structures that have been experimentally determined using X-ray crystallography or NMR spectroscopy. These homology approaches are based on the evolutionary assumption that high sequence similarity implies high functional and structural similarity. Although these techniques work well with some proteins, they are unable to determine the structure of proteins that are unlike any in the PDB and do not offer insights into the biochemical mechanisms that cause a protein to fold.

Work on RNA structure determination uses a similar assumption that the nucleotide base sequence of an RNA strand completely determines its 3-dimensional structure. Most approaches model the RNA chain as a collection of bases that can sometimes pair with each other for a fixed reduction or increase in energy, dependent on their location within the structure relative to nearby pairings. Dynamic programming techniques are then used to compute the set of pairings that sum to the lowest free energy. These methods can determine correct structures reasonably well but are possibly limited in their ability to model the important forces. Further, the energy bonuses of base-pairings have been derived from a limited set of experiments and may not represent the optimal assignment of values.

Our Approach

For both proteins and RNA, we have decided to tackle the problem of determining structure from sequence by transforming it into a straightforward optimization problem. Unlike most other approaches, we will focus on cost functions for which the global minimum can be found so that our results exclusively reflect the quality of the cost function, not the quality of the optimizer. Because we view everything as a flat optimization problem, our cost functions will allow us to capture many long-range interactions between molecules that more common methods ignore. Hidden Markov models, context free grammars, and LP formulations are all good candidates for cost function models that will be investigated. In fact, recent work [1,2] has shown that grammar-based approaches may be well suited to this problem, and we intend to extend these results to larger classes of proteins. Dynamic programming and linear constraint-based problem solving techniques will generally be employed to optimize these cost functions.

To improve our cost functions, we will not only consider experimentally derived energy and environmental parameters, but we will also use machine learning to determine the values of other biophysically-motivated parameters. This will enhance the richness of the cost functions we use and allow us to better model the underlying interactions that determine the structure of proteins and RNA. In particular, two techniques [3,4] that focus on structured output spaces are well adapted to our particular learning needs.


Information pertaining to our current work can be found at http://protein.csail.mit.edu. We have successfully applied our approach to the problem of determining the secondary structure of proteins that contain only alpha helices. Our method uses support vector hidden Markov models to determine the structure of a protein given only its sequence and is described in a poster [5] and a workshop report [6]. We have also successfully applied this method of learning to find a greatly reduced set of RNA base-pair interaction parameters for the Vienna RNA Package that perform equally and sometimes better than those that are experimentally determined.

Beyond helical prediction, we are interested in predicting residue contact interactions of proteins with regular β-sheet structures such as transmembrane β-barrel proteins. Inspired by earlier grammatical models for structure prediction [2], we are developing new model and algorithms for describing the set of biologically realizable structures. For this we are also building energy cost functions that rely on both atomistic force calculations and macroscopic experimental properties.

Concurrently, we are investigating efficient methods for studying protein-protein and protein-RNA interactions. These techniques sample from a predicted set of probable conformations and use fast structure-approximation algorithms to compare and score two interacting molecules according to their steric and biochemical constraints.


[1] J. Waldispühl and J-M. Steyaert. Modeling and Predicting all-Alpha Transmembrane Proteins Including Helix-Helix Pairing. Theoretical Computer Science.

[2] J. Waldispühl, B. Berger, P. Clote, and J-M. Steyaert. Predicting Transmembrane Beta-barrels and Inter-strand Residue Interactions from Sequence. In PROTEINS: Structure, Function and Bioinformatics, 65(1) pp. 61-74 (2006)

[3] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004.

[4] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning Structured Prediction Models: A Large Margin Approach. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005.

[5] Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas. Learning Biophysically-Motivated Parameters for Alpha Helix Prediction. Poster at the 10th Annual International Conference on Research in Computational Molecular Biology, Venice Lido, Italy, 2006.

[6] Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas. Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Proceedings of the 2006 Workshop on Pattern Recognition in Bioinformatics, Volume 4146 of Lecture Notes in Computer Science, Hong Kong, China, 2006.

vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu