Show simple item record

dc.contributor.advisorKoehn, Philipp
dc.contributor.advisorOsborne, Miles
dc.contributor.authorHoang, Hieu
dc.date.accessioned2012-01-25T13:38:18Z
dc.date.available2012-01-25T13:38:18Z
dc.date.issued2011-11-24
dc.identifier.urihttp://hdl.handle.net/1842/5781
dc.description.abstractStatistical machine translation (SMT) should benefit from linguistic information to improve performance but current state-of-the-art models rely purely on data-driven models. There are several reasons why prior efforts to build linguistically annotated models have failed or not even been attempted. Firstly, the practical implementation often requires too much work to be cost effective. Where ad-hoc implementations have been created, they impose too strict constraints to be of general use. Lastly, many linguistically-motivated approaches are language dependent, tackling peculiarities in certain languages that do not apply to other languages. This thesis successfully integrates linguistic information about part-of-speech tags, lemmas and phrase structure to improve MT quality. The major contributions of this thesis are: 1. We enhance the phrase-based model to incorporate linguistic information as additional factors in the word representation. The factored phrase-based model allows us to make use of different types of linguistic information in a systematic way within the predefined framework. We show how this model improves translation by as much as 0.9 BLEU for small German-English training corpora, and 0.2 BLEU for larger corpora. 2. We extend the factored model to the factored template model to focus on improving reordering. We show that by generalising translation with part-of-speech tags, we can improve performance by as much as 1.1 BLEU on a small French- English system. 3. Finally, we switch from the phrase-based model to a syntax-based model with the mixed syntax model. This allows us to transition from the word-level approaches using factors to multiword linguistic information such as syntactic labels and shallow tags. The mixed syntax model uses source language syntactic information to inform translation. We show that the model is able to explain translation better, leading to a 0.8 BLEU improvement over the baseline hierarchical phrase-based model for a small German-English task. Also, the model requires only labels on continuous source spans, it is not dependent on a tree structure, therefore, other types of syntactic information can be integrated into the model. We experimented with a shallow parser and see a gain of 0.5 BLEU for the same dataset. Training with more training data, we improve translation by 0.6 BLEU (1.3 BLEU out-of-domain) over the hierarchical baseline. During the development of these three models, we discover that attempting to rigidly model translation as linguistic transfer process results in degraded performance. However, by combining the advantages of standard SMT models with linguistically-motivated models, we are able to achieve better translation performance. Our work shows the importance of balancing the specificity of linguistic information with the robustness of simpler models.en
dc.contributor.sponsorEngineering and Physical Sciences Research Council (EPSRC)en
dc.contributor.sponsorEuropean Unionen
dc.contributor.sponsorDefense Advanced Research Projects Agency (DARPA)en
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.relation.hasversionHoang, H. and Koehn, P. (2008). Design of the moses decoder for statistical machine translation. In Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pages 58–65, Columbus, Ohio. Association for Computational Linguistics.en
dc.relation.hasversionHoang, H. and Koehn, P. (2009). Improving mid-range re-ordering using templates of factors. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 372–379, Athens, Greece. Association for Computational Linguistics.en
dc.relation.hasversionHoang, H. and Koehn, P. (2010). Improved translation with source syntax labels. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 409–417, Uppsala, Sweden. Association for Computational Linguistics.en
dc.relation.hasversionHoang, H., Koehn, P., and Lopez, A. (2009). A Unified Framework for Phrase-Based, Hierarchical, and Syntax-Based Statistical Machine Translation. In Proc. of the International Workshop on Spoken Language Translation, pages 152–159, Tokyo, Japan.en
dc.relation.hasversionKoehn, P., Federico, M., Shen, W., Bertoldi, N., Bojar, O., Callison-Burch, C., Cowan, B., Dyer, C., Hoang, H., Zens, R., Constantin, A., Moran, C. C., and Herbst, E. (2006). Open source toolkit for statistical machine translation. Technical report, Johns Hopkins University.en
dc.relation.hasversionKoehn, P. and Hoang, H. (2007). Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 868–876.en
dc.relation.hasversionKoehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.en
dc.subjectmachine translationen
dc.subjectnatural languageen
dc.titleImproving statistical machine translation with linguistic informationen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record