|
Edinburgh Research Archive >
Informatics, School of >
Informatics thesis and dissertation collection >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/1842/3213
|
| Title: | Modelling Speech Dynamics with Trajectory-HMMs |
| Authors: | Zhang, Le |
| Supervisor(s): | Renals, Steve |
| Issue Date: | 2009 |
| Publisher: | The University of Edinburgh |
| Abstract: | The conditional independence assumption imposed by the hidden Markov models
(HMMs) makes it difficult to model temporal correlation patterns in human speech.
Traditionally, this limitation is circumvented by appending the first and second-order
regression coefficients to the observation feature vectors. Although this leads to improved
performance in recognition tasks, we argue that a straightforward use of dynamic
features in HMMs will result in an inferior model, due to the incorrect handling
of dynamic constraints. In this thesis I will show that an HMM can be transformed
into a Trajectory-HMM capable of generating smoothed output mean trajectories, by
performing a per-utterance normalisation. The resulting model can be trained by either
maximisingmodel log-likelihood or minimisingmean generation errors on the training
data. To combat the exponential growth of paths in searching, the idea of delayed path
merging is proposed and a new time-synchronous decoding algorithm built on the concept
of token-passing is designed for use in the recognition task. The Trajectory-HMM
brings a new way of sharing knowledge between speech recognition and synthesis
components, by tackling both problems in a coherent statistical framework. I evaluated
the Trajectory-HMM on two different speech tasks using the speaker-dependent
MOCHA-TIMIT database. First as a generative model to recover articulatory features
from speech signal, where the Trajectory-HMM was used in a complementary way
to the conventional HMM modelling techniques, within a joint Acoustic-Articulatory
framework. Experiments indicate that the jointly trained acoustic-articulatory models
are more accurate (having a lower Root Mean Square error) than the separately trained
ones, and that Trajectory-HMM training results in greater accuracy compared with
conventional Baum-Welch parameter updating. In addition, the Root Mean Square
(RMS) training objective proves to be consistently better than the Maximum Likelihood
objective. However, experiment of the phone recognition task shows that the
MLE trained Trajectory-HMM, while retaining attractive properties of being a proper
generative model, tends to favour over-smoothed trajectories among competing hypothesises,
and does not perform better than a conventional HMM. We use this to
build an argument that models giving a better fit on training data may suffer a reduction
of discrimination by being too faithful to the training data. Finally, experiments
on using triphone models show that increasing modelling detail is an effective way to
leverage modelling performance with little added complexity in training. |
| Description: | Institute for Communicating and Collaborative Systems |
| Keywords: | Informatics Institute for Communicating and Collaborative Systems Speech Technology Research |
| URI: | http://hdl.handle.net/1842/3213 |
| Appears in Collections: | Informatics thesis and dissertation collection
|
Items in ERA are protected by copyright, with all rights reserved, unless otherwise indicated.
|