<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>ERA Collection:</title>
    <link>http://hdl.handle.net/1842/917</link>
    <description />
    <pubDate>Sun, 19 May 2013 01:46:10 GMT</pubDate>
    <dc:date>2013-05-19T01:46:10Z</dc:date>
    <image>
      <title>ERA Collection:</title>
      <url>http://www.era.lib.ed.ac.uk:80/retrieve/2237/cstr3.png</url>
      <link>http://hdl.handle.net/1842/917</link>
    </image>
    <item>
      <title>Synthesis of fast speech with interpolation of adapted HSMMs and its evaluation by blind and sighted listeners</title>
      <link>http://hdl.handle.net/1842/4542</link>
      <description>Title: Synthesis of fast speech with interpolation of adapted HSMMs and its evaluation by blind and sighted listeners
Authors: Pucher, Michael; Schabus, Dietmar; Yamagishi, Junichi
Abstract: In this paper we evaluate a method for generating synthetic speech at high speaking rates based on the interpolation of hidden semi-Markov models (HSMMs) trained on speech data recorded at normal and fast speaking rates. The subjective evaluation was carried out with both blind listeners, who are used to very fast speaking rates, and sighted listeners. We show that we can achieve a better intelligibility rate and higher voice quality with this method compared to standard HSMM-based duration modeling. We also evaluate duration modeling with the interpolation of all the acoustic features including not only duration but also spectral and F0 models. An analysis of the mean squared error (MSE) of standard HSMM-based duration modeling for fast speech identifies problematic linguistic contexts for duration modeling.</description>
      <pubDate>Fri, 01 Jan 2010 00:00:00 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/1842/4542</guid>
      <dc:date>2010-01-01T00:00:00Z</dc:date>
    </item>
    <item>
      <title>Out-of-vocabulary spoken term detection</title>
      <link>http://hdl.handle.net/1842/4087</link>
      <description>Title: Out-of-vocabulary spoken term detection
Authors: Wang, Dong
Abstract: Spoken term detection (STD) is a fundamental task for multimedia information&#xD;
retrieval. A major challenge faced by an STD system is the serious performance reduction&#xD;
when detecting out-of-vocabulary (OOV) terms. The difficulties arise not only&#xD;
from the absence of pronunciations for such terms in the system dictionaries, but from&#xD;
intrinsic uncertainty in pronunciations, significant diversity in term properties and a&#xD;
high degree of weakness in acoustic and language modelling.&#xD;
To tackle the OOV issue, we first applied the joint-multigram model to predict pronunciations&#xD;
for OOV terms in a stochastic way. Based on this, we propose a stochastic&#xD;
pronunciation model that considers all possible pronunciations for OOV terms so that&#xD;
the high pronunciation uncertainty is compensated for.&#xD;
Furthermore, to deal with the diversity in term properties, we propose a termdependent&#xD;
discriminative decision strategy, which employs discriminative models to&#xD;
integrate multiple informative factors and confidence measures into a classification&#xD;
probability, which gives rise to minimum decision cost.&#xD;
In addition, to address the weakness in acoustic and language modelling, we propose&#xD;
a direct posterior confidence measure which replaces the generative models with&#xD;
a discriminative model, such as a multi-layer perceptron (MLP), to obtain a robust&#xD;
confidence for OOV term detection.&#xD;
With these novel techniques, the STD performance on OOV terms was improved&#xD;
substantially and significantly in our experiments set on meeting speech data.</description>
      <pubDate>Fri, 01 Jan 2010 00:00:00 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/1842/4087</guid>
      <dc:date>2010-01-01T00:00:00Z</dc:date>
    </item>
    <item>
      <title>Speaker normalisation for large vocabulary multiparty conversational speech recognition</title>
      <link>http://hdl.handle.net/1842/3983</link>
      <description>Title: Speaker normalisation for large vocabulary multiparty conversational speech recognition
Authors: Garau, Giulia
Abstract: One of the main problems faced by automatic speech recognition is the variability of&#xD;
the testing conditions. This is due both to the acoustic conditions (different transmission&#xD;
channels, recording devices, noises etc.) and to the variability of speech&#xD;
across different speakers (i.e. due to different accents, coarticulation of phonemes&#xD;
and different vocal tract characteristics). Vocal tract length normalisation (VTLN)&#xD;
aims at normalising the acoustic signal, making it independent from the vocal tract&#xD;
length. This is done by a speaker specific warping of the frequency axis parameterised&#xD;
through a warping factor. In this thesis the application of VTLN to multiparty&#xD;
conversational speech was investigated focusing on the meeting domain. This&#xD;
is a challenging task showing a great variability of the speech acoustics both across&#xD;
different speakers and across time for a given speaker. VTL, the distance between&#xD;
the lips and the glottis, varies over time. We observed that the warping factors estimated&#xD;
using Maximum Likelihood seem to be context dependent: appearing to be&#xD;
influenced by the current conversational partner and being correlated with the behaviour&#xD;
of formant positions and the pitch. This is because VTL also influences the&#xD;
frequency of vibration of the vocal cords and thus the pitch. In this thesis we also&#xD;
investigated pitch-adaptive acoustic features with the goal of further improving the&#xD;
speaker normalisation provided by VTLN.&#xD;
We explored the use of acoustic features obtained using a pitch-adaptive analysis&#xD;
in combination with conventional features such as Mel frequency cepstral coefficients.&#xD;
These spectral representations were combined both at the acoustic feature&#xD;
level using heteroscedastic linear discriminant analysis (HLDA), and at the system&#xD;
level using ROVER. We evaluated this approach on a challenging large vocabulary&#xD;
speech recognition task: multiparty meeting transcription. We found that VTLN&#xD;
benefits the most from pitch-adaptive features. Our experiments also suggested that&#xD;
combining conventional and pitch-adaptive acoustic features using HLDA results in&#xD;
a consistent, significant decrease in the word error rate across all the tasks. Combining&#xD;
at the system level using ROVER resulted in a further significant improvement.&#xD;
Further experiments compared the use of pitch adaptive spectral representation with&#xD;
the adoption of a smoothed spectrogram for the extraction of cepstral coefficients.&#xD;
It was found that pitch adaptive spectral analysis, providing a representation which&#xD;
is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to&#xD;
be advantageous when HLDA is applied. The combination of a pitch adaptive spectral&#xD;
representation and VTLN based speaker normalisation in the context of LVCSR&#xD;
for multiparty conversational speech led to more speaker independent acoustic models&#xD;
improving the overall recognition performances.</description>
      <pubDate>Thu, 01 Jan 2009 00:00:00 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/1842/3983</guid>
      <dc:date>2009-01-01T00:00:00Z</dc:date>
    </item>
    <item>
      <title>Identification of Contrast and Its Emphatic Realization in HMM-based Speech Synthesis</title>
      <link>http://hdl.handle.net/1842/3963</link>
      <description>Title: Identification of Contrast and Its Emphatic Realization in HMM-based Speech Synthesis
Authors: Badino, Leonardo; Andersson, J. Sebastian; Yamagishi, Junichi; Clark, Robert A J
Abstract: The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hidden-Markov-Model (HMM) based speech synthesis system. We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its performance on a corpus of spontaneous conversations in English. Subsequently we describe the set of features selected to train a HMM-based speech synthesis system and attempting to properly control prosodic prominence (including emphasis). Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as acceptable as their non-emphatic counterpart, while emphasis on non-contrastive pairs is almost never acceptable.</description>
      <pubDate>Thu, 01 Jan 2009 00:00:00 GMT</pubDate>
      <guid isPermaLink="false">http://hdl.handle.net/1842/3963</guid>
      <dc:date>2009-01-01T00:00:00Z</dc:date>
    </item>
  </channel>
</rss>

