Information Services banner Edinburgh Research Archive The University of Edinburgh crest

Edinburgh Research Archive >
Philosophy, Psychology and Language Sciences, School of >
Linguistics and English Language >
Linguistics and English Language publications >

Please use this identifier to cite or link to this item:

This item has been viewed 92 times in the last year. View Statistics

Files in This Item:

File Description SizeFormat
YamagishiJ_Thousands of Voices.pdf2.56 MBAdobe PDFView/Open
Title: Thousands of Voices for HMM-Based Speech Synthesis-Analysis and Application of TTS Systems Built on Various ASR Corpora
Authors: Yamagishi, Junichi
Usabaev, Bela
King, Simon
Watts, Oliver
Dines, John
Tian, Jilei
Guan, Yong
Hu, Rile
Oura, Keiichiro
Wu, Yi-Jian
Tokuda, Keiichi
Karhila, Reima
Kurimo, Mikko
Issue Date: May-2010
Journal Title: IEEE Transactions On Audio Speech and Language Processing
Volume: 18
Issue: 5
Page Numbers: 984-1004
Publisher: IEEE
Abstract: In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.
Keywords: hidden Markov models
speaker recognition
speech synthesis
Automatic speech recognition (ASR)
H Triple S (HTS)
SPEECON database
WSJ database
average voice
hidden Markov model (HMM)-based speech synthesis
speaker adaptation
speech synthesis
voice conversion
ISSN: 1558-7916
Appears in Collections:Linguistics and English Language publications

Items in ERA are protected by copyright, with all rights reserved, unless otherwise indicated.


Valid XHTML 1.0! Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh 2013, and/or the original authors. Privacy and Cookies Policy