MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

A Discriminative Tree-to-Tree Translation Model for Chinese to English

Dan Wheeler, Brooke Cowan, Chao Wang & Michael Collins

Overview

State-of-the-art statistical machine translation techniques perform poorly on languages with drastically different syntax. While any translation scheme will struggle harder the less its two languages have in common, we believe one crucial reason for this performance drop-off stems from a near-complete lack of syntactic analysis. Current systems tend to do an excellent job translating content correctly, but for a hard language pair, particularly over longer sentences, the grammatical relations between content words and phrases rarely come out right. The resulting translations are often unreadable, and perhaps worse, misleading (accidentally swapping subject and object in "Alice said to Bob ...," as a common example).

Severe grammatical discrepancies, including wildly differing word order, makes Chinese to English one of the hardest language pairs to work with. The good news is that steady advances in parsing over the last decade have made syntactically motivated translation more viable. We plan to extend the discriminative tree-to-tree framework of Cowan et al [1] to the Chinese->English language pair, with the conjecture that explicitly modeling grammar through syntactic analysis on both languages will lead to improved performance over current methods.

Tree-to-Tree Framework of Cowan et al

Brooke Cowan, Ivona Kučerová and Michael Collins designed and implemented a tree-to-tree translation framework for German to English in 2006 [1], with similar performance to phrase-based systems [2]. At the highest level, translation proceeds sentence-by-sentence by splitting each source sentence into clauses, predicting a syntactic structure in the target language called an Aligned Extended Projection (AEP) for each source clause independently, and, finally, linking the AEPs together and collapsing to obtain the final target sentence. Verbal arguments, including subjects and objects, and adjunctive modifiers (such as PPs and ADJPs), are translated separately, currently using a phrase-based system [2]. The hope is to achieve better grammaticality through syntactic modelling at a high level, yet still maintain the same quality content translation by applying current methods to reduced (verbless) expressions. The figure below shows a simplified view of the split and predict stages on a Chinese->English example. In practice, linking and collapsing is an easy final step.

Aligned Extended Projections build on the concept of Extended Projections in lexicalized tree adjoining grammars (LTAG) as described in [3], through the addition of alignment information based on work in synchronous LTAG [4]. Roughly speaking, an extended projection applies to a content word in a parse tree (such as a noun or verb), and consists of a tree fragment around the content word that includes its associated function words, such as complementizers, determiners, and prepositions. Importantly, an extended projection around a verb encapsulates that verb's argument structure — how the subject and object attach to the clausal tree fragment, and with what function words in between.

An aligned extended projection of a main verb is an EP in the target language that contains alignment information to the corresponding clause in the source language. Cowan's tree-to-tree framework provides methods for (1) extracting {source clause, target AEP} training pairs from a parallel treebank, and (2) training a discriminative feature-based model that predicts target AEPs from source clauses. A more detailed explanation, replete with examples, may be found in the original paper [1].

A few key properties are worth mentioning:

Extraction: Training data is conservatively extracted (high precision, low recall), meaning more data is thrown away but the resulting training set is of a relatively higher quality. Specifically, for a sentence pair from the corpus to be selected, the clause split stage must have detected the same number of clauses for each language. As an additional check, a method that utilizes GIZA++ word alignments [5] is applied to align noun phrases and prepositional phrases in parallel sentences. Only clauses with consistent NP/PP alignments are used as training examples. It is important to be conservative because the parse trees themselves may not be perfect for any number of reasons earlier in the pipeline.
Training: The AEP prediction model is a linear history-based model trained on the averaged perceptron algorithm [6] and decoded using beam search. One strength of the approach is that feature sets can be developed with relative ease for new language pairs: while features in the original model exploited German/English dependencies, a new set will soon be custom-built for Chinese/English. Possible features include structural information about the source clause and target AEP under consideration, properties of the function words present (such as modal verbs, complementizers and wh words), and, on the English side, inflectional properties of the main verb.An additional advantage of the feature-based approach is the flexibility with which qualitatively wide-ranging features can be integrated together.

Proposed Translator

Our training data consists of unprocessed newspaper articles from the government-run news agency of the People's Republic of China, translated in parallel to English. We plan to wrap the translation framework described above into an end-to-end system that can translate such articles. Because Chinese sentences do not contain spaces between words, tokenization must first be performed. The sentence is then parsed using a Chinese-tailored variant of the Collins parser [7]. After the tree-to-tree translation step for each clause, the resulting English AEPs are reassembled and flattened, resulting in an English output sentence.

Chinese-Specific Challenges

We're currently thinking about how to best cope with the following Chinese-specific issues.

Error propagation from tokenization: The extra processing step of tokenizing Chinese sentences into words creates a new layer of ambiguity and an early source for error that can easily derail later steps. It will therefore be important to add new sanity checks during data extraction.
Lack of inflectional morphology: The Chinese language family contains no verb conjugations, and no tense, gender or number agreement. Tense, for example, is established implicitly, and can only be determined for a given sentence without other context if it contains a temporal modifier phrase (such as "last Saturday").
Subject/Object dropping: once established in previous sentences, the subject and object (sometimes both) may be dropped entirely. This makes sentence-by-sentence translation particularly ill-suited to Chinese. We're currently thinking about ways to use subject and object information from previous sentences in such cases, but it's a hard problem.
Rich idiomatic usage: Chinese formal writing is particularly full of condensed idioms, which often contain a subject and verb and yet should be translated into English as a single word. One 4-character idiom, for example, translates directly as draw snake add legs, and yet in most sentences should be translated as the single word unnecessary or redundant. The Chinese language contains thousands of such examples.

In Essence

The goal of this project is twofold. We plan to:

Extend the tree-to-tree framework of Cowan et al to handle Chinese-specific challenges.
Build an end-to-end Chinese to English translator, primarily for translating newspaper articles.

Support

This research is made possible through a sub-contract to BBN Technologies Corp under DARPA grant HR0011-06-C-0022

References:

[1] Brooke Cowan, Ivona Kučerová and Michael Collins. A discriminative model for tree-to-tree translation. In EMNLP 2006, 2006

[2] Philipp Koehn, Franz J. Och and Daniel Marcu. Statistical phrase based translation. In HLT/NAACL 04, 2004

[3] Robert Frank. Phrase Structure Composition and Syntactic Dependencies. Cambridge, MA: MIT Press, 2002.

[4] Stuart M. Shieber, Yves Schabes. Synchronous tree-adjoining grammars. In Proceedings of the 13th International Conference on Computational Linguistics, 1990

[5] Franz J. Och, Hermann Ney. A systematic comparison of various statistical alignment models. In Computational Linguistics, 29(1):19-51, 2003

[6] Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. In ACL 04, 2004

[7] Michael Collins. Head-Driven Statistical Models for Natural Language Processing. University of Pennsylvania, 1999

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu