CSAIL Research Abstract

Protein Structure Prediction Using Dynamic Programming and Structured Output Support Vector Machines

Blaise Gassend, Charles W. O'Donnell, William Thies, Christopher Batten, G. Edward Suh,
Dwaine Clarke, Marten van Dijk & Srinivas Devadas

The Protein Folding Problem

The protein folding problem is one of the core problems of modern biology. Proteins are one of the fundamental building blocks of life as we know it, as they make up most of the machinery that allows cells to operate. Each protein is a chain of amino-acids. There are twenty kinds of amino-acids, and each protein is characterized by a specific sequence of a few hundred of them. The process by which information encoded in the DNA is used by a cell to generate a protein with a specific sequence of amino-acids is very well understood, and is one of the major successes of molecular biology. However, understanding the sequence of amino acids is not sufficient for understanding the function of a protein. Indeed, the functional properties of a protein are related to the 3-D configuration of amino-acid chain. Experimental evidence suggests that the 3-D structure of a protein depends only on the sequence of amino-acids that make it up: when a protein is placed in an environment that is different from its native environment, it unfolds; when it is replaced in its native environment, it usually re-adopts its original 3-D configuration. The protein folding problem consists in figuring out from a protein's sequence what 3-D configuration it will adopt.

Existing Approaches

Decades of work have been put into the protein folding problem, but so far the results are unsatisfying. Every two years during the CASP experiment [4], groups worldwide put their algorithms into a competition in which they are tested on on proteins that are about to be resolved experimentally. The most successful approaches work by comparing the protein to be folded with proteins in the Protein Data Bank (PDB), a database that brings together all the protein structures that have been experimentally observed using X-ray crystallography. When similar sequences are found, similar structures are expected. These so-called homology approaches give good results on proteins with similarities to proteins for which the structure has already been resolved, but are unable to solve completely new families of proteins. They also fail to give much insight into the mechanisms by which proteins adopt their folded configuration. Atomistic approaches, by contrast model a protein as a system of atoms interacting via force fields. They then use molecular dynamics or Monte Carlo methods to try to find the lowest free energy state of the system of atoms. So far these methods are not fast enough to solve anything but the smallest proteins, but they are useful for refining structures that have been determined by other means.

Our Approach

We have decided to attack the protein folding problem as an optimization problem. We define a cost function over proteins and then optimize it to its global minimum. Unlike a lot of other work, we will focus on cost functions for which the global minimum can be found so that our results reflect exclusively the quality of the cost function, not the quality of the optimizer. We will generally be using dynamic programming algorithms to optimize the cost function, which allows us to capture many long range interactions that the commonly employed hidden Markov model approaches. Recent work [1] has shown that grammar/dynamic programming approaches can give good results in the case of transmembrane proteins. We hope to extend these results to larger classes of proteins. One way in which we hope to improve over [1] is to allow more richness in the cost function we use. In [1], the cost function was built to use experimentally determined indices of hydrophobicity. In our work, we would like more richness in the cost function than can be provided by experimentally measured parameters, so we will use learning methods to determine what values to use for each parameters. Machine learning on structured output spaces neatly applies to our problem. In particular, the learning method described in [2] should allow us to learn parameters for our optimization algorithm.

Progress

This project is just getting started. So far we have a basic code base in OCaml for accessing PDB files, manipulating and displaying structures, and for optimizing structures given a set of parameters. Our short term goal is to interface our code with SVM^light [3] to try to learn parameters for determining the structure of all-alpha proteins.

Future

Once we have seen how well our basic method performs on all-alpha proteins, we will try to extend it, first to beta-sheets, then to other elements of secondary structure. We also plan on using feature selection methods to learn which parameters to consider when training our algorithms. Hopefully we will be able to gain insight on how proteins fold by observing which parameters are important, and what values they take.

References

[1] J. Waldispühl and J-M. Steyaert. Modeling and Predicting all-Alpha Transmembrane Proteins Including Helix-Helix Pairing. To be published in Theoretical Computer Science.

[2] I. Tsochantaridis, T. Hofmann, T. Joachims and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004.

[3] T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.

[4] J. Mout, K. Fidelis, A. Zemla and T. Hubbard. Critical Assessment of Methods of Protein Structure Prediction. In PROTEINS: Structure, Function and Genetics, 53:334-339 (2003).