![]()
|
Research
Abstracts - 2007 |
|
Exponentiated Gradient Algorithms for Log-Linear Structured PredictionAmir Globerson, Xavier Carreras, Terry Koo & Michael CollinsAbstractConditional log-linear models are a popular method for structured prediction. Efficient learning of the model parameters is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes an exponentiated gradient (EG) algorithm for training such models. EG is applied to the convex dual of the maximum likelihood objective; this results in both sequential and parallel update algorithms, where in the sequential algorithm parameters are updated in an online fashion. We provide a convergence proof for both algorithms, which also applies to max-margin learning, simplifying previous results on EG learning of max-margin models, and leading to a tighter bound on the convergence rate for these models. Experimental results on a large scale parsing task show that the proposed algorithm converges much faster than a conjugate-gradient approach both in terms of optimization objective and test error. Overview Structured learning problems involve the prediction of objects such
as sequences or parse trees. The set of possible objects may be exponentially
large, but each object typically has rich internal structure. Several
models that implement learning in this scenario have been proposed over
the last few years [1,2]. The underlying idea in these methods is to classify
each instance ![]() Here In both CRFs and max-margin models, the learning task involves an optimization problem which is convex, and therefore does not suffer from problems with local minima. However, finding the optimal weight vector w may still be computationally intensive, especially for very large data sets with millions of data points. In the current paper, we propose a fast and efficient algorithm for optimizing log-linear models such as CRFs. To highlight the difficulty we address, consider the common practice of optimizing the conditional log-likelihood of a CRF using conjugate gradient algorithms, or other methods such as L-BFGS [3]. Any update to the weight vector would require at least one pass over the data set, and typically more due to line-search calls. Since conjugate gradient convergence typically requires several tens of iterations if not hundreds, the above optimization scheme can turn out to be quite slow. It would seem advantageous to have an algorithm that updates the weight vector after every sample point, instead of after the entire data set. One example of such algorithms is stochastic gradient descent and its variants [4]. A different approach, which we employ here, was first suggested by Jaakkola
and Haussler [5]. It proceeds via the convex dual problem of the likelihood
maximization problem. As in the dual Support Vector Machine (SVM) problem,
the dual variables correspond to data points ![]() It
is thus tempting to optimize We provide a convergence proof of the algorithm, using a symmetrization of the KL divergence. Our proof is relatively simple, and applies both to the max-margin and log-linear settings. Furthermore, we show that convergence can be obtained for the max-margin case with a higher learning rate than that used in the proof of Bartlett et al. [9], giving a tighter bound on the rate of convergence. Finally, we compare our EG algorithm to a conjugate gradient optimization on a standard multiclass learning problem, and a complex natural language parsing problem. In both settings we show that EG converges much faster than conjugate gradient, both in terms of objective function and classification error. Primal and Dual Problems for Log-Linear Models Consider a supervised learning setting where we are given training data
![]() where
![]() where ![]() as aggregates of these duals. Let ![]() be the set of distributions over the labels for a given example. The dual problem can be expressed as ![]() where the dual
objective function ![]() Exponentiated Gradient AlgorithmsThe method of Exponentiated Gradient (EG) updates was first introduced by Kivinen and Warmuth [8] and extended to structured learning by Bartlett et al. [9]. We briefly describe batch and online-like methods for performing EG updates on our dual optimization problem (cf. D-LL above). Given an initial distribution ![]() where ![]() The batch
and online-like EG updates are then simple to describe. In the batch setting,
we fix ![]() In the online
setting, we consider each ![]() Convergence of the Batch and Online Updates We briefly sketch our proof of convergence. The analysis relies on manipulations
of three divergence measures between sets of distributions ![]() Here,
![]() By the definition
of ![]() This
result allows us to deduce that both batch and online-like EG updates
will eventually converge at the optimum of the dual problem D-LL. Furthermore,
by applying an additional lemma from Kivinen and Warmuth [8], we can prove
that the batch algorithm will converge within AcknowledgementsThis work was funded by a grant from the National Science Foundation (DMS-0434222) and a grant from NTT, Agmt. Dtd. 6/21/1998. In addition, A. Globerson is supported by a postdoctoral fellowship from the Rothschild Foundation - Yad Hanadiv. References:[1] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2004. [2] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, 2001. [3] Sha, F., and Pereira, F. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL, 2003. [4] Vishwanathan, S. N., Schraudolph, N. N., Schmidt, M.W., and Murphy, K. P. Accelerated training of conditional random fields with stochastic gradient methods. In Proceedings of ICML, 2006. [5] Jaakkola, T., and Haussler, D. Probabilistic kernel regression models. In Proceedings of AISTATS, 1999. [6] Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in kernel methods - support vector learning. MIT Press, 1998. [7] Keerthi, S., Duan, K., Shevade, S., and Poo, A. N. A fast dual algorithm for kernel logistic regression. Machine Learning, 61, 151-165, 2005. [8] Kivinen, J., and Warmuth, M. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1-63, 1997. [9] Bartlett, P., Collins, M., Taskar, B., and McAllester, D. Exponentiated gradient algorithms for large-margin structured classification. In NIPS, 2004. [10] Lebanon, G, and Lafferty, J. Boosting and maximum likelihood for exponential models. In NIPS, 2004. |
![]() ![]() |
||
|