Show simple item record

dc.contributor.advisorKing, Simon
dc.contributor.advisorRenals, Stephen
dc.contributor.authorMerritt, Thomas
dc.date.accessioned2017-06-22T14:08:27Z
dc.date.available2017-06-22T14:08:27Z
dc.date.issued2017-07-07
dc.identifier.urihttp://hdl.handle.net/1842/22071
dc.description.abstractAt the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any modelling is even performed. Therefore the hybrid synthesis paradigm is investigated. These systems aim to remove the effect of vocoding by using SPSS to inform the selection of units in a unit selection system. Both of the motivated improvements applied in Part II are found to make significant gains in synthesis quality, demonstrating the benefit of performing the style of perceptual testing conducted in the thesis.en
dc.contributor.sponsorEngineering and Physical Sciences Research Council (EPSRC)en
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.relation.hasversionHenter, G. E., Merritt, T., Shannon, M., Mayo, C., and King, S. (2014a). Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In Proc. Interspeech, number September, pages 1504–1508.en
dc.relation.hasversionHenter, G. E., Merritt, T., Shannon, M., Mayo, C., and King, S. (2014b). Repeated Harvard Speech corpus version 0.5, [dataset]. University of Edinburgh, The Centre for Speech Technology Research (CSTR); Cambridge University Engineering Department. doi:10.7488/ds/39.en
dc.relation.hasversionMerritt, T., Clark, R. A. J., Wu, Z., Yamagishi, J., and King, S. (2016). Deep neural network-guided unit selection synthesis. In Proc. ICASSP.en
dc.relation.hasversionMerritt, T., Clark, R. A. J., Wu, Z., Yamagishi, J., and King, S. (2016). Listening test materials for “Deep neural network-guided unit selection synthesis”, 2016 [dataset]. University of Edinburgh, The Centre for Speech Technology Research (CSTR), doi:10.7488/ds/1313.en
dc.relation.hasversionMerritt, T. and King, S. (2013). Investigating the shortcomings of HMM synthesis. In Proc. SSW, pages 165–170.en
dc.relation.hasversionMerritt, T., Latorre, J., and King, S. (2015). Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech. In Proc. ICASSP.en
dc.relation.hasversionMerritt, T., Raitio, T., and King, S. (2014). Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis. In Proc. Interspeech, number September, pages 1509–1513.en
dc.relation.hasversionMerritt, T., Ronanki, S.,Wu, Z., andWatts, O. (2016). The CSTR entry to the Blizzard Challenge 2016. In Proc. Blizzard Challenge workshop.en
dc.relation.hasversionMerritt, T., Yamagishi, J., Wu, Z., Watts, O., and King, S. (2015). Deep neural network context embeddings for model selection in rich-context HMM synthesis. In Proc. Interspeech.en
dc.relation.hasversionMerritt, T., Yamagishi, J., Wu, Z., Watts, O., and King, S. (2015). Listening test materials for “Deep neural network context embeddings for model selection in richcontext HMM synthesis”, 2015 [dataset]. University of Edinburgh, The Centre for Speech Technology Research (CSTR), doi:10.7488/ds/256.en
dc.relation.hasversionWatts, O., Henter, G. E., Merritt, T., Wu, Z., and King, S. (2016). From HMMs to DNNs: Where do the improvements come from? In Proc. ICASSP.en
dc.relation.hasversionWatts, O., Henter, G. E., Merritt, T., Wu, Z., and King, S. (2016). Listening test materials for “From HMMs to DNNs: Where do the improvements come from?”, 2016 [dataset]. University of Edinburgh, The Centre for Speech Technology Research (CSTR), doi:10.7488/ds/1316.en
dc.rightsAttribution-NonCommercial-ShareAlike 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/*
dc.subjectspeech synthesisen
dc.subjectstatistical parametric speech synthesisen
dc.subjectSPSSen
dc.subjectunit selectionen
dc.subjecthybriden
dc.titleOvercoming the limitations of statistical parametric speech synthesisen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 4.0 International
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 4.0 International