HMM-based speech synthesis using an acoustic glottal source model
Parametric speech synthesis has received increased attention in recent years following the development of statistical HMM-based speech synthesis. However, the speech produced using this method still does not sound as natural as human speech and there is limited parametric flexibility to replicate voice quality aspects, such as breathiness. The hypothesis of this thesis is that speech naturalness and voice quality can be more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal source model, the Liljencrants-Fant (LF) model, to represent the source component of speech instead of the traditional impulse train. Two different analysis-synthesis methods were developed during this thesis, in order to integrate the LF-model into a baseline HMM-based speech synthesiser, which is based on the popular HTS system and uses the STRAIGHT vocoder. The first method, which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model signal through a glottal post-filter to obtain the source signal and then generating speech, by passing this source signal through the spectral envelope filter. The system which uses the GPF method (HTS-GPF system) is similar to the baseline system, but it uses a different source signal instead of the impulse train used by STRAIGHT. The second method, called Glottal Spectral Separation (GSS), generates speech by passing the LF-model signal through the vocal tract filter. The major advantage of the synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic properties of the LF-model parameters are automatically learnt by the HMMs. In this thesis, an initial perceptual experiment was conducted to compare the LFmodel to the impulse train. The results showed that the LF-model was significantly better, both in terms of speech naturalness and replication of two basic voice qualities (breathy and tense). In a second perceptual evaluation, the HTS-LF system was better than the baseline system, although the difference between the two had been expected to be more significant. A third experiment was conducted to evaluate the HTS-GPF system and an improved HTS-LF system, in terms of speech naturalness, voice similarity and intelligibility. The results showed that the HTS-GPF system performed similarly to the baseline. However, the HTS-LF system was significantly outperformed by the baseline. Finally, acoustic measurements were performed on the synthetic speech to investigate the speech distortion in the HTS-LF system. The results indicated that a problem in replicating the rapid variations of the vocal tract filter parameters at transitions between voiced and unvoiced sounds is the most significant cause of speech distortion. This problem encourages future work to further improve the system.