Syntactic and semantic features for statistical and neural machine translation
MetadataShow full item record
Machine Translation (MT) for language pairs with long distance dependencies and word reordering, such as German–English, is prone to producing output that is lexically or syntactically incoherent. Statistical MT (SMT) models used explicit or latent syntax to improve reordering, however failed at capturing other long distance dependencies. This thesis explores how explicit sentence-level syntactic information can improve translation for such complex linguistic phenomena. In particular, we work at the level of the syntactic-semantic interface with representations conveying the predicate-argument structures. These are essential to preserving semantics in translation and SMT systems have long struggled to model them. String-to-tree SMT systems use explicit target syntax to handle long-distance reordering, but make strong independence assumptions which lead to inconsistent lexical choices. To address this, we propose a Selectional Preferences feature which models the semantic affinities between target predicates and their argument fillers using the target dependency relations available in the decoder. We found that our feature is not effective in a string-to-tree system for German→English and that often the conditioning context is wrong because of mistranslated verbs. To improve verb translation, we proposed a Neural Verb Lexicon Model (NVLM) incorporating sentence-level syntactic context from the source which carries relevant semantic information for verb disambiguation. When used as an extra feature for re-ranking the output of a German→ English string-to-tree system, the NVLM improved verb translation precision by up to 2.7% and recall by up to 7.4%. While the NVLM improved some aspects of translation, other syntactic and lexical inconsistencies are not being addressed by a linear combination of independent models. In contrast to SMT, neural machine translation (NMT) avoids strong independence assumptions thus generating more fluent translations and capturing some long-distance dependencies. Still, incorporating additional linguistic information can improve translation quality. We proposed a method for tightly coupling target words and syntax in the NMT decoder. To represent syntax explicitly, we used CCG supertags, which encode subcategorization information, capturing long distance dependencies and attachments. Our method improved translation quality on several difficult linguistic constructs, including prepositional phrases which are the most frequent type of predicate arguments. These improvements over a strong baseline NMT system were consistent across two language pairs: 0.9 BLEU for German→English and 1.2 BLEU for Romanian→English.