
Research
Abstracts  2007 
Dependency Parsing Models via the MatrixTree TheoremTerry Koo, Xavier Carreras, Amir Globerson & Michael CollinsAbstractWe study training algorithms for dependency parsing that use marginals derived from loglinear distributions over parse trees. Although the insideoutside algorithm yields these marginals for projective structures, it is not applicable to the nonprojective case. We present a novel algorithm for calculating marginals of distributions over nonprojective trees, via an adaptation of Kirchhoff's MatrixTree Theorem. We explore the use of such marginals in training loglinear models, and in exponentiatedgradient optimization of maxmargin models. Empirical results in multilingual dependency parsing show that the proposed algorithms outperform the averaged perceptron. OverviewRecent work has applied discriminative machine learning methods to the problem of natural language parsing. A key advantage of these methods is their ability to employ rich and flexible features. In particular, discriminative methods have been applied to dependency parsing, with excellent empirical results [1]. One advantage of dependency parsing over constituent parsing is that nonprojective structures are easily represented, which is important for some languages such as Czech. McDonald et al. [1] observed that nonprojective parses correspond to spanning trees on the words of a sentence. Based on this observation, they introduce an online learning algorithm for nonprojective parsing that achieves good performance on nonprojective languages. While online algorithms are easy to implement, and scale well to large datasets, loglinear models [2,3] or maxmargin methods [4] often provide better performance. The latter algorithms explicitly minimize the loss incurred when parsing the entire training corpus. We develop a framework for training such models for dependency parsing. We specifically address the theoretical problems associated with the nonprojective case, and show how they may be resolved using tools from combinatorial graph theory. Training algorithms for loglinear models typically rely on the computation of partition functions and marginals; these computations require summations over the space of possible structures. In the case of projective parsing, the partition function and marginals can be evaluated in time using an insideoutside style algorithm [5] for the Eisner [6] parser. However, this approach does not carry over to the nonprojective case, which involves enumeration over the set of all spanning trees. As we show here, this enumeration problem can in fact be solved by adapting a wellknown result in graph theory: Kirchhoff's Matrix Tree Theorem [7]. A naive application of the theorem yields and algorithms for computation of the partition function and marginals, respectively. However, our adaptation finds the partition function and marginals in time using simple matrix determinant and inversion operations. The use of marginal computations is not confined to training loglinear models. We also show how the marginals obtained here suffice to train maxmargin models for nonprojective dependency parsing, via an Exponentiated Gradient (EG) algorithm [8]. The experiments presented here constitute the first application of this algorithm to a largescale problem. We applied the marginalbased algorithms to six languages with varying degrees of nonprojectivity, using datasets obtained from the CoNLLX shared task [9]. Our experimental framework compared three training algorithms: loglinear models, maxmargin models (using EG training), and the averaged perceptron. Each of these was applied to both projective and nonprojective parsing. Our results demonstrate that marginalbased algorithms outperform the perceptron. Furthermore, we find nonprojective training approaches typically yield performance that is as good as or better than projective training. Three Inference Problems for Marginalbased Dependency ParsingFocus on a single, fixed, sentence and assume that we are given a score vector that specifies a score for every possible dependency in the sentence. Furthermore, assume that the scores are to be interpreted additively, so that the score for a tree decomposes into a sum over the scores of its dependencies: where indicates a headmodifier dependency with head index and modifier index . Note that in a typical linear model for dependency parsing, the scores would be defined in terms of inner products between a parameter vector and feature vector: Using the score vector, we define a globallynormalized loglinear probability model where is the partition function, which normalizes the distribution. We also consider the marginals, which give the probability of any given dependency being active: We define the following three inference problems in the context of this probability model. Decoding : Computation of the Partition Function : Computation of the Marginals : Efficient solutions to these three inference problems are essential for our marginalbased approaches to dependency parsing. First, note that the decoding problem above is equivalent to the basic parsing task. In addition, efficient algorithms for recovering the partition function and marginals suffice to allow gradientbased optimization of globallynormalized loglinear models. Furthermore, efficient computation of the marginals suffices for Exponentiated Gradient [8] optimization of maxmargin models. Nonprojective Parsing with the MatrixTree TheoremThere are wellknown algorithms for efficiently solving all three inference problems in the case of projective dependency parsing. For nonprojective parsing, efficient algorithms have only been proposed for the Decoding problem [1]. We show that an adaptation of the MatrixTree Theorem yields efficient algorithms for the other two inference problems. Consider a complete directed graph over the words of the sentence and note that nonprojective parses of the sentence correspond to spanning trees of this graph. Suppose that the edges of this digraph are weighted according to the following weighted adjacency matrix: Then, the product of the edgeweights of a spanning tree is proportional to the probability of the corresponding nonprojective parse in a globallynormalized loglinear model. Moreover, the constant of proportionality is exactly equal to the partition function: Therefore, the sum of the product of edge weights for all spanning trees of this graph recovers the partition function: The Laplacian of a graph is a matrix derived from the (possiblyweighted) adjacency matrix of that graph. Kirchoff's MatrixTree Theorem relates the sumproduct of all weighted spanning trees rooted at some fixed node to the determinant of a submatrix of the graph Laplacian. The partition function is given by the sumproduct of all weighted spanning trees rooted at any node, so a straightforward application of the MatrixTree Theorem would involve a summation of determinants—i.e., an algorithm. However, we find that the determinant of a modified Laplacian matrix yields the partition function directly, resulting in an algorithm. In order to recover the marginals efficiently, we apply two identities. First, for a globallynormalized loglinear model, the marginals are given by the partial derivatives of the log partition function: Second, the derivative of the logdeterminant of a matrix has a simple form: These two identities can be combined via the chain rule, resulting in an algorithm for computing the marginals that is dominated by the matrix inversion operation. Therefore, the marginals can be obtained in time. Extensions: Multiple Roots and LabelsWe briefly mention two extensions to our basic method. First, in some dependency parsing tasks, the desired structures are forests rather than trees—i.e., more than one root node may be allowed. The existing approaches for singleroot parsing, both projective and nonprojective, can be extended to multipleroot parsing quite naturally. Second, dependency parsing often involves training and prediction of labeled dependencies, which are tuples of the form , where is the index of a label. The three inference problems are then redefined to incorporate additional maximizations or summations over the set of possible labels. However, because the treescoring function completely decouples the labels, the unlabeled algorithms can continue to be applied, modulo some simple preprocessing. Specifically, for the decoding inference problem, it suffices to define and apply the unlabeled decoding algorithm to the resulting set of dependency scores. Analogously, for the partition function and marginals inference problems, defining and applying the unlabeled algorithms again suffices. Multilingual Dependency Parsing ExperimentsWe used our marginalbased algorithms to train dependency parsers (both projective and nonprojective) for six different languages: Arabic, Dutch, Japanese, Slovene, Spanish, and Turkish. Training and test data for these languages were obtained from the CoNLLX shared task. We evaluated two marginalbased approaches: loglinear models (with Conjugate Gradient training) and maxmargin models (with Exponentiated Gradient training). We compared the parsers trained by the marginalbased approaches to baseline parsers that were trained by the averaged perceptron; note that the averaged perceptron is known to be competitive for many tasks. Feature sets and data partitions were identical for all three training algorithms. Our final tests showed that in a crosslinguistic comparison, the marginalbased algorithms produced significantly () better parsers than the averaged perceptron. AcknowledgementsThis work was funded by a grant from the National Science Foundation (DMS0434222) and a grant from NTT, Agmt. Dtd. 6/21/1998. In addition, A. Globerson is supported by a postdoctoral fellowship from the Rothschild Foundation  Yad Hanadiv. References:[1] R. McDonald, K. Crammer, and F. Pereira. Online largemargin training of dependency parsers. In Proceedings of ACL, 2005. [2] M. Johnson, S. Geman, S. Canon, Z. Chi, and S. Riezler. Estimators for stochastic unificationbased grammars. In Proceedings of ACL, 1999. [3] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, 2001. [4] B. Taskar, C. Guestrin, and D. Koller. Maxmargin markov networks. In NIPS, 2004. [5] J. Baker. Trainable grammars for speech recognition. In 97th meeting of the Acoustical Society of America, 1979. [6] J. Eisner. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of COLING, 1996. [7] W. Tutte. Graph Theory. AddisonWesley, 1984. [8] P. Bartlett, M. Collins, B. Taskar, and D. McAllester. Exponentiated gradient algorithms for largemargin structured classification. In NIPS, 2004. [9] S. Buchholz and E. Marsi. CoNLLX shared task on multilingual dependency parsing. In Proceedings of CoNLLX, 2006. 

