Protein Structure Prediction Using Dynamic Programming and Structured
Output Support Vector Machines
Blaise Gassend, Charles W. O'Donnell, William Thies, Christopher Batten,
G. Edward Suh,
Dwaine Clarke, Marten van Dijk & Srinivas Devadas
The Protein Folding Problem
The protein folding problem is one of the core problems of modern biology.
Proteins are one of the fundamental building blocks of life as we know
it, as they make up most of the machinery that allows cells to operate.
Each protein is a chain of amino-acids. There are twenty kinds of amino-acids,
and each protein is characterized by a specific sequence of a few hundred
of them. The process by which information encoded in the DNA is used by
a cell to generate a protein with a specific sequence of amino-acids is
very well understood, and is one of the major successes of molecular biology.
However, understanding the sequence of amino acids is not sufficient for
understanding the function of a protein. Indeed, the functional properties
of a protein are related to the 3-D configuration of amino-acid chain.
Experimental evidence suggests that the 3-D structure of a protein depends
only on the sequence of amino-acids that make it up: when a protein is
placed in an environment that is different from its native environment,
it unfolds; when it is replaced in its native environment, it usually
re-adopts its original 3-D configuration. The protein folding problem
consists in figuring out from a protein's sequence what 3-D configuration
it will adopt.
Existing Approaches
Decades of work have been put into the protein folding problem, but
so far the results are unsatisfying. Every two years during the CASP experiment
[4], groups worldwide put their algorithms into a competition
in which they are tested on on proteins that are about to be resolved
experimentally. The most successful approaches work by comparing the protein
to be folded with proteins in the Protein Data Bank (PDB), a database
that brings together all the protein structures that have been experimentally
observed using X-ray crystallography. When similar sequences are found,
similar structures are expected. These so-called homology approaches give
good results on proteins with similarities to proteins for which the structure
has already been resolved, but are unable to solve completely new families
of proteins. They also fail to give much insight into the mechanisms by
which proteins adopt their folded configuration. Atomistic approaches,
by contrast model a protein as a system of atoms interacting via force
fields. They then use molecular dynamics or Monte Carlo methods to try
to find the lowest free energy state of the system of atoms. So far these
methods are not fast enough to solve anything but the smallest proteins,
but they are useful for refining structures that have been determined
by other means.
Our Approach
We have decided to attack the protein folding problem as an optimization
problem. We define a cost function over proteins and then optimize it
to its global minimum. Unlike a lot of other work, we will focus on cost
functions for which the global minimum can be found so that our results
reflect exclusively the quality of the cost function, not the quality
of the optimizer. We will generally be using dynamic programming algorithms
to optimize the cost function, which allows us to capture many long range
interactions that the commonly employed hidden Markov model approaches.
Recent work [1] has shown that grammar/dynamic programming
approaches can give good results in the case of transmembrane proteins.
We hope to extend these results to larger classes of proteins. One way
in which we hope to improve over [1] is to allow more
richness in the cost function we use. In [1], the cost
function was built to use experimentally determined indices of hydrophobicity.
In our work, we would like more richness in the cost function than can
be provided by experimentally measured parameters, so we will use learning
methods to determine what values to use for each parameters. Machine learning
on structured output spaces neatly applies to our problem. In particular,
the learning method described in [2] should allow us
to learn parameters for our optimization algorithm.
Progress
This project is just getting started. So far we have a basic code base
in OCaml for accessing PDB files, manipulating
and displaying structures, and for optimizing structures given a set of
parameters. Our short term goal is to interface our code with
SVMlight [3] to try to learn parameters
for determining the structure of all-alpha proteins.
Future
Once we have seen how well our basic method performs on all-alpha proteins,
we will try to extend it, first to beta-sheets, then to other elements
of secondary structure. We also plan on using feature selection methods
to learn which parameters to consider when training our algorithms. Hopefully
we will be able to gain insight on how proteins fold by observing which
parameters are important, and what values they take.
References
[1] J. Waldispühl and J-M. Steyaert. Modeling and Predicting
all-Alpha Transmembrane Proteins Including Helix-Helix Pairing. To be
published in Theoretical Computer Science.
[2] I. Tsochantaridis, T. Hofmann, T. Joachims
and Y. Altun. Support Vector Machine Learning for Interdependent and Structured
Output Spaces. In Proceedings of the 21st International Conference
on Machine Learning, Banff, Canada, 2004.
[3] T. Joachims. Making large-Scale SVM Learning Practical.
Advances in Kernel Methods - Support Vector Learning, B. Schölkopf
and C. Burges and A. Smola (ed.), MIT-Press, 1999.
[4] J. Mout, K. Fidelis, A. Zemla and T. Hubbard. Critical
Assessment of Methods of Protein Structure Prediction. In PROTEINS:
Structure, Function and Genetics, 53:334-339 (2003).
|