CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

Protein and RNA Structure Prediction Using Energy Minimization and Machine Learning

Charles W. O'Donnell, Jason Eisenberg, Blaise Gassend, William Thies, Jerome Waldispühl, Marten van Dijk & Srinivas Devadas

Protein and RNA Structures

Proteins are the building blocks of life as we know it. These microscopic 3-dimensional structures inhabit living cells and serve as the machinery responsible for almost all molecular processes within an organism, such as metabolism, signal transduction, gene regulation, and cell replication. Despite the fact that proteins perform such critical functions, each protein is only composed of a single sequential chain of a few hundred amino acids that are folded back onto themselves into a 3-dimensional packing. It is this 3-dimensional shape of amino acid chains that ultimately determines the functionality of a protein.

RNA macromolecules are also essential to the basic functions of life. Classically, RNA was believed to be a chain of nucleic acids that transports information from DNA to ribosomes in a cell to allow for protein synthesis. However, some RNA has also been shown to form 3-dimensional structures that perform biological functions similar to those of proteins. Just as with proteins, the 3-dimensional structure of an RNA macromolecule determines its function in the cell and is composed of a sequential chain of nucleotides folded back onto themselves forming pairings between bases.

Since the functionality of proteins and RNA is so important to life itself, determining the structure of every existing protein and RNA remains one of the core problems of modern biology.

Existing Approaches for Protein and RNA Structure Determination

Experimental evidence suggests that the 3-dimensional structure of a protein or RNA is entirely dependent on the corresponding amino acid or nucleotide base sequence. For example, when a protein is placed in a different environment than normal, it can unfold into a single long strand of amino acids. When the protein is returned to its original environment, it usually readopts its original 3-dimensional configuration.

For decades researchers have used this assumption to try to determine the 3-dimensional structure of a protein given only its amino acid sequence. The most successful approaches work by matching the sequence of a target protein whose structure we wish to determine to those of proteins whose structures are already known. To find these matches, one can search the Protein Data Bank (PDB), which contains over 35,000 structures that have been experimentally determined using X-ray crystallography or NMR spectroscopy. These homology approaches are based on the evolutionary assumption that high sequence similarity implies high functional and structural similarity. Although these techniques work well with some proteins, they are unable to determine the structure of proteins that are unlike any in the PDB and do not offer insights into the biochemical mechanisms that cause a protein to fold.

In contrast, atomistic approaches model a protein as a system of atoms with interacting force fields. They use molecular dynamics simulations or Monte Carlo and simulated annealing techniques to try to find the lowest free energy state of the system. Atomistic methods can give useful insights into how a protein folds, but they are currently not fast enough to be applied to proteins of reasonable sizes. However, they are still a useful approach for refining structures that have been determined by other means.

Work on RNA structure determination uses a similar assumption that the nucleotide base sequence of an RNA strand completely determines its 3-dimensional structure. Most approaches model the RNA chain as a collection of bases that can sometimes pair with each other for a fixed reduction or increase in energy, dependent on their location within the structure relative to nearby pairings. Dynamic programming techniques are then used to compute the set of pairings that sum to the lowest free energy. These methods can determine correct structures reasonably well but are possibly limited in their ability to model the important forces. Further, the energy bonuses of base-pairings have been derived from a limited set of experiments and may not represent the optimal assignment of values.

Our Approach

For both proteins and RNA, we have decided to tackle the problem of determining structure from sequence by transforming it into a straightforward optimization problem. Unlike most other approaches, we will focus on cost functions for which the global minimum can be found so that our results exclusively reflect the quality of the cost function, not the quality of the optimizer. Because we view everything as a flat optimization problem, our cost functions will allow us to capture many long-range interactions between molecules that more common methods ignore. Hidden Markov models, context free grammars, and LP formulations are all good candidates for cost function models that will be investigated. In fact, recent work [1,2] has shown that grammar-based approaches may be well suited to this problem, and we intend to extend these results to larger classes of proteins. Dynamic programming and linear constraint-based problem solving techniques will generally be employed to optimize these cost functions.

To improve our cost functions, we will not only consider experimentally derived energy and environmental parameters, but we will also use machine learning to determine the values of other biophysically-motivated parameters. This will enhance the richness of the cost functions we use and allow us to better model the underlying interactions that determine the structure of proteins and RNA. In particular, two techniques [3,4] that focus on structured output spaces appear well adapted to our particular learning needs.


Information pertaining to our current work can be found at http://protein.csail.mit.edu. Thus far, we have successfully applied our approach to the problem of determining the secondary structure of proteins that only contain alpha helices. Our method uses support vector hidden Markov models to determine the structure of a protein given only its sequence and is described in a poster [5] and an accompanying technical report [6]. Finally, we have been working with the RNA structure determination algorithm from the Vienna RNA Package.


For protein structure determination, we are currently working to extend our models to be able to predict beta-strand secondary structures, protein residue contact pairings, and other elements of protein structure. We are also exploring other machine learning techniques and plan to incorporate environmental parameters into our cost functions. Our RNA structure determination algorithm is being similarly fine-tuned.

This summer also marks the seventh biennial CASP experiment [7]. In this worldwide competition, hundreds of groups put their best structure determination algorithms to the test by predicting the structures of new protein sequences that are about to be resolved experimentally. At the end of the competition, all of the submitted structures are compared to see who was best able to predict the structure of the newly determined protein structures. We will be tracking this competition to see how well our own algorithms compare.


[1] J. Waldispühl and J-M. Steyaert. Modeling and Predicting all-Alpha Transmembrane Proteins Including Helix-Helix Pairing. Theoretical Computer Science.

[2] J. Waldispühl, B. Berger, P. Clote, and J-M. Steyaert. Predicting Transmembrane Beta-barrels and Inter-strand Residue Interactions from Sequence. To appear in PROTEINS, 2006.

[3] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004.

[4] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning Structured Prediction Models: A Large Margin Approach. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005.

[5] Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas. Learning Biophysically-Motivated Parameters for Alpha Helix Prediction. Poster at the 10th Annual International Conference on Research in Computational Molecular Biology, Venice Lido, Italy, 2006.

[6] Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas. Secondary Structure Prediction of All-Helical Proteins Using Hidden Markov Support Vector Machines MIT CSAIL Technical Report 1003 (MIT-LCS-TR-1003)

[7] J. Moult, K. Fidelis, B. Rost, T. Hubbard, A. Tramontano. Critical Assessment of Methods of Protein Structure Prediction (CASP) - Round 6. In PROTEINS: Structure, Function and Genetics, 61(S7), p 3-7 (2005).

vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu