Experimental evidence suggests that the 3-dimensional structure of a protein or RNA is entirely dependent on the corresponding amino acid or nucleotide base sequence. For example, when a protein is placed in a different environment than normal, it can unfold into a single long strand of amino acids. When the protein is returned to its original environment, it usually readopts its original 3-dimensional configuration.
For decades researchers have used this assumption to try to determine the 3-dimensional structure of a protein given only its amino acid sequence. The most successful approaches work by matching the sequence of a target protein whose structure we wish to determine to those of proteins whose structures are already known. To find these matches, one can search the Protein Data Bank (PDB), which contains over 40,000 structures that have been experimentally determined using X-ray crystallography or NMR spectroscopy. These homology approaches are based on the evolutionary assumption that high sequence similarity implies high functional and structural similarity. Although these techniques work well with some proteins, they are unable to determine the structure of proteins that are unlike any in the PDB and do not offer insights into the biochemical mechanisms that cause a protein to fold.
Work on RNA structure determination uses a similar assumption that the nucleotide base sequence of an RNA strand completely determines its 3-dimensional structure. Most approaches model the RNA chain as a collection of bases that can sometimes pair with each other for a fixed reduction or increase in energy, dependent on their location within the structure relative to nearby pairings. Dynamic programming techniques are then used to compute the set of pairings that sum to the lowest free energy. These methods can determine correct structures reasonably well but are possibly limited in their ability to model the important forces. Further, the energy bonuses of base-pairings have been derived from a limited set of experiments and may not represent the optimal assignment of values.
For both proteins and RNA, we have decided to tackle the problem of determining structure from sequence by transforming it into a straightforward optimization problem. Unlike most other approaches, we will focus on cost functions for which the global minimum can be found so that our results exclusively reflect the quality of the cost function, not the quality of the optimizer. Because we view everything as a flat optimization problem, our cost functions will allow us to capture many long-range interactions between molecules that more common methods ignore. Hidden Markov models, context free grammars, and LP formulations are all good candidates for cost function models that will be investigated. In fact, recent work [1,2] has shown that grammar-based approaches may be well suited to this problem, and we intend to extend these results to larger classes of proteins. Dynamic programming and linear constraint-based problem solving techniques will generally be employed to optimize these cost functions.
To improve our cost functions, we will not only consider experimentally derived energy and environmental parameters, but we will also use machine learning to determine the values of other biophysically-motivated parameters. This will enhance the richness of the cost functions we use and allow us to better model the underlying interactions that determine the structure of proteins and RNA. In particular, two techniques [3,4] that focus on structured output spaces are well adapted to our particular learning needs.
Information pertaining to our current work can be found at http://protein.csail.mit.edu. We have successfully applied our approach to the problem of determining the secondary structure of proteins that contain only alpha helices. Our method uses support vector hidden Markov models to determine the structure of a protein given only its sequence and is described in a poster  and a workshop report . We have also successfully applied this method of learning to find a greatly reduced set of RNA base-pair interaction parameters for the Vienna RNA Package that perform equally and sometimes better than those that are experimentally determined.
Beyond helical prediction, we are interested in predicting residue contact interactions of proteins with regular β-sheet structures such as transmembrane β-barrel proteins. Inspired by earlier grammatical models for structure prediction , we are developing new model and algorithms for describing the set of biologically realizable structures. For this we are also building energy cost functions that rely on both atomistic force calculations and macroscopic experimental properties.
Concurrently, we are investigating efficient methods for studying protein-protein and protein-RNA interactions. These techniques sample from a predicted set of probable conformations and use fast structure-approximation algorithms to compare and score two interacting molecules according to their steric and biochemical constraints.
 J. Waldispühl and J-M. Steyaert. Modeling and Predicting all-Alpha Transmembrane Proteins Including Helix-Helix Pairing. Theoretical Computer Science.
 J. Waldispühl, B. Berger, P. Clote, and J-M. Steyaert. Predicting Transmembrane Beta-barrels and Inter-strand Residue Interactions from Sequence. In PROTEINS: Structure, Function and Bioinformatics, 65(1) pp. 61-74 (2006)
 I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004.
 B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning Structured Prediction Models: A Large Margin Approach. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005.
 Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas. Learning Biophysically-Motivated Parameters for Alpha Helix Prediction. Poster at the 10th Annual International Conference on Research in Computational Molecular Biology, Venice Lido, Italy, 2006.
 Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas. Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Proceedings of the 2006 Workshop on Pattern Recognition in Bioinformatics, Volume 4146 of Lecture Notes in Computer Science, Hong Kong, China, 2006.
Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - email@example.com