<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <title>ERA Collection:</title>
  <link rel="alternate" href="http://hdl.handle.net/1842/3763" />
  <subtitle />
  <id>http://hdl.handle.net/1842/3763</id>
  <updated>2013-05-22T11:39:52Z</updated>
  <dc:date>2013-05-22T11:39:52Z</dc:date>
  <entry>
    <title>Unsupervised adaptation for HMM-based speech synthesis</title>
    <link rel="alternate" href="http://hdl.handle.net/1842/3841" />
    <author>
      <name>King, Simon</name>
    </author>
    <author>
      <name>Tokuda, Keiichi</name>
    </author>
    <author>
      <name>Zen, Heiga</name>
    </author>
    <author>
      <name>Yamagishi, Junichi</name>
    </author>
    <id>http://hdl.handle.net/1842/3841</id>
    <updated>2013-04-09T13:07:47Z</updated>
    <published>2008-09-01T00:00:00Z</published>
    <summary type="text">Title: Unsupervised adaptation for HMM-based speech synthesis
Authors: King, Simon; Tokuda, Keiichi; Zen, Heiga; Yamagishi, Junichi
Abstract: It is now possible to synthesise speech using HMMs with a comparable quality to unit-selection techniques. Generating speech from a model has many potential advantages over concatenating waveforms. The most exciting is model adaptation. It has been shown that supervised speaker adaptation can yield high- quality synthetic voices with an order of magnitude less data than required to train a speaker-dependent model or to build a basic unit-selection system. Such supervised methods require labelled adaptation data for the target speaker. In this paper, we introduce a method capable of unsupervised adaptation, using only speech from the target speaker without any labelling.</summary>
    <dc:date>2008-09-01T00:00:00Z</dc:date>
  </entry>
  <entry>
    <title>A Shrinkage Estimator for Speech Recognition with Full Covariance HMMs</title>
    <link rel="alternate" href="http://hdl.handle.net/1842/3839" />
    <author>
      <name>Bell, Peter</name>
    </author>
    <author>
      <name>King, Simon</name>
    </author>
    <id>http://hdl.handle.net/1842/3839</id>
    <updated>2010-10-05T12:45:15Z</updated>
    <published>2008-01-01T00:00:00Z</published>
    <summary type="text">Title: A Shrinkage Estimator for Speech Recognition with Full Covariance HMMs
Authors: Bell, Peter; King, Simon
Abstract: We consider the problem of parameter estimation in full-covariance Gaussian mixture systems for automatic speech recognition. Due to the high dimensionality of the acoustic feature vector, the standard sample covariance matrix has a high variance and is often poorly-conditioned when the amount of training data is limited. We explain how the use of a shrinkage estimator can solve these problems, and derive a formula for the optimal shrinkage intensity. We present results of experiments on a phone recognition task, showing that the estimator gives a performance improvement over a standard full-covariance system</summary>
    <dc:date>2008-01-01T00:00:00Z</dc:date>
  </entry>
  <entry>
    <title>Cross-lingual Portability of MLP-Based Tandem Features -- A Case Study for English and Hungarian</title>
    <link rel="alternate" href="http://hdl.handle.net/1842/3838" />
    <author>
      <name>Toth, Laszlo</name>
    </author>
    <author>
      <name>Frankel, Joe</name>
    </author>
    <author>
      <name>Gosztolya, Gabor</name>
    </author>
    <author>
      <name>King, Simon</name>
    </author>
    <id>http://hdl.handle.net/1842/3838</id>
    <updated>2010-10-05T12:43:37Z</updated>
    <published>2008-01-01T00:00:00Z</published>
    <summary type="text">Title: Cross-lingual Portability of MLP-Based Tandem Features -- A Case Study for English and Hungarian
Authors: Toth, Laszlo; Frankel, Joe; Gosztolya, Gabor; King, Simon
Abstract: One promising approach for building ASR systems for less-resourced languages is cross-lingual adaptation. Tandem ASR is particularly well suited to such adaptation, as it includes two cascaded modelling steps: feature extraction using multi-layer perceptrons (MLPs), followed by modelling using a standard HMM. The language-specific tuning can be performed by adjusting the HMM only, leaving the MLP untouched. Here we examine the portability of feature extractor MLPs between an Indo-European (English) and a Finno-Ugric (Hungarian) language. We present experiments which use both conventional phone-posterior and articulatory feature (AF) detector MLPs, both trained on a much larger quantity of (English) data than the monolingual (Hungarian) system. We find that the cross-lingual configurations achieve similar performance to the monolingual system, and that, interestingly, the AF detectors lead to slightly worse performance, despite the expectation that they should be more language-independent than phone-based MLPs. However, the cross-lingual system outperforms all other configurations when the English phone MLP is adapted on the Hungarian data.</summary>
    <dc:date>2008-01-01T00:00:00Z</dc:date>
  </entry>
  <entry>
    <title>A comparison of phone and grapheme-based spoken term detection</title>
    <link rel="alternate" href="http://hdl.handle.net/1842/3837" />
    <author>
      <name>Wang, Dong</name>
    </author>
    <author>
      <name>Frankel, Joe</name>
    </author>
    <author>
      <name>Tejedor, Javier</name>
    </author>
    <author>
      <name>King, Simon</name>
    </author>
    <id>http://hdl.handle.net/1842/3837</id>
    <updated>2010-10-05T12:43:27Z</updated>
    <published>2008-01-01T00:00:00Z</published>
    <summary type="text">Title: A comparison of phone and grapheme-based spoken term detection
Authors: Wang, Dong; Frankel, Joe; Tejedor, Javier; King, Simon
Abstract: We propose grapheme-based sub-word units for spoken term detection (STD). Compared to phones, graphemes have a number of potential advantages. For out-of-vocabulary search terms, phone- based approaches must generate a pronunciation using letter-to-sound rules. Using graphemes obviates this potentially error-prone hard decision, shifting pronunciation modelling into the statistical models describing the observation space. In addition, long-span grapheme language models can be trained directly from large text corpora. We present experiments on Spanish and English data, comparing phone and grapheme-based STD. For Spanish, where phone and grapheme-based systems give similar transcription word error rates (WERs), grapheme-based STD significantly outperforms a phone- based approach. The converse is found for English, where the phone-based system outperforms a grapheme approach. However, we present additional analysis which suggests that phone-based STD performance levels may be achieved by a grapheme-based approach despite lower transcription accuracy, and that the two approaches may usefully be combined. We propose a number of directions for future development of these ideas, and suggest that if grapheme-based STD can match phone-based performance, the inherent flexibility in dealing with out-of-vocabulary terms makes this a desirable approach.</summary>
    <dc:date>2008-01-01T00:00:00Z</dc:date>
  </entry>
</feed>

