Show simple item record

dc.contributor.advisorWebber, Bonnie
dc.contributor.authorOblau, Sarah
dc.date.accessioned2014-03-20T11:51:17Z
dc.date.available2014-03-20T11:51:17Z
dc.date.issued2012-11-28
dc.identifier.urihttp://hdl.handle.net/1842/8468
dc.description.abstractThe training process of the translation model in statistical machine translation requires a sentence-aligned parallel corpus of source and target language. Most available parallel corpora are at best document-aligned, so sentence alignment is performed on the document-aligned parallel corpus as a pre-processing step to word alignment and building the phrase translation table. In the process of sentence alignment, some data is discarded for "quality reasons", usually because of N:1 sentence alignments. This work presents a set of rules based on empirical analysis of discourse strategies in data discarded during the alignment process of Europarl data. These rules are developed to split the long sentence in 2:1/1:2 sentence alignments, leading to two 1:1 sentence alignments which are added to the training data. I present three evaluation methods addressing the split performance and applicability as well as the impact on the translation table of the data gained, and show that the sentence splits determined by the rules lead to more grammatical sentences on each side of the split than a proportionate split, and record small improvements in BLEU score of a translation system trained with the additional data compared to one without. Findings also indicate that the rules developed are domain-specific to the Europarl corpus and result in bad sentence splits of N:1 alignments of news report data.en
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.subjectSMTen
dc.subjectdiscourseen
dc.subjectsentence alignmenten
dc.subjectdiscourse connectivesen
dc.titleUsing Discourse Strategies to Improve Sentence Alignment in Statistical Machine Translationen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelMastersen
dc.type.qualificationnameMSc Master of Scienceen
dcterms.accessRightsRestricted Accessen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record