Information Services banner Edinburgh Research Archive The University of Edinburgh crest

Edinburgh Research Archive >
Informatics, School of >
Informatics thesis and dissertation collection >

Please use this identifier to cite or link to this item:

This item has been viewed 22 times in the last year. View Statistics

Files in This Item:

File Description SizeFormat
Le Zhang PhD thesis 09.pdf1.51 MBAdobe PDFView/Open
Title: Modelling Speech Dynamics with Trajectory-HMMs
Authors: Zhang, Le
Supervisor(s): Renals, Steve
Issue Date: 2009
Publisher: The University of Edinburgh
Abstract: The conditional independence assumption imposed by the hidden Markov models (HMMs) makes it difficult to model temporal correlation patterns in human speech. Traditionally, this limitation is circumvented by appending the first and second-order regression coefficients to the observation feature vectors. Although this leads to improved performance in recognition tasks, we argue that a straightforward use of dynamic features in HMMs will result in an inferior model, due to the incorrect handling of dynamic constraints. In this thesis I will show that an HMM can be transformed into a Trajectory-HMM capable of generating smoothed output mean trajectories, by performing a per-utterance normalisation. The resulting model can be trained by either maximisingmodel log-likelihood or minimisingmean generation errors on the training data. To combat the exponential growth of paths in searching, the idea of delayed path merging is proposed and a new time-synchronous decoding algorithm built on the concept of token-passing is designed for use in the recognition task. The Trajectory-HMM brings a new way of sharing knowledge between speech recognition and synthesis components, by tackling both problems in a coherent statistical framework. I evaluated the Trajectory-HMM on two different speech tasks using the speaker-dependent MOCHA-TIMIT database. First as a generative model to recover articulatory features from speech signal, where the Trajectory-HMM was used in a complementary way to the conventional HMM modelling techniques, within a joint Acoustic-Articulatory framework. Experiments indicate that the jointly trained acoustic-articulatory models are more accurate (having a lower Root Mean Square error) than the separately trained ones, and that Trajectory-HMM training results in greater accuracy compared with conventional Baum-Welch parameter updating. In addition, the Root Mean Square (RMS) training objective proves to be consistently better than the Maximum Likelihood objective. However, experiment of the phone recognition task shows that the MLE trained Trajectory-HMM, while retaining attractive properties of being a proper generative model, tends to favour over-smoothed trajectories among competing hypothesises, and does not perform better than a conventional HMM. We use this to build an argument that models giving a better fit on training data may suffer a reduction of discrimination by being too faithful to the training data. Finally, experiments on using triphone models show that increasing modelling detail is an effective way to leverage modelling performance with little added complexity in training.
Description: Institute for Communicating and Collaborative Systems
Keywords: Informatics
Institute for Communicating and Collaborative Systems
Speech Technology Research
Appears in Collections:Informatics thesis and dissertation collection

Items in ERA are protected by copyright, with all rights reserved, unless otherwise indicated.


Valid XHTML 1.0! Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh 2013, and/or the original authors. Privacy and Cookies Policy