Polyglot voice design for unit selection speech synthesis
Item statusRestricted Access
MetadataShow full item record
Current text-to-speech (TTS) systems are increasingly faced with mixed language textual input. Most TTS systems are designed to allow building synthetic voices for different languages, but each voice is able to ”speak” only one language at a time. In order to synthesize mixed language input, polyglot voices are needed which are able to switch between languages when it is required by textual input. A polyglot voice will typically have one basic language and additionally the ability to synthesize foreign words when these are encountered in the textual input. Design of polyglot voices for unit selection speech synthesis is still a research question. An inherent problem of unit selection speech synthesis is that the synthesis quality is closely related to the contents of the unit database. Concatenation of units not in the database usually results in bad synthesis quality. At the same time, building the database with good coverage of units results in a prohibitively large database if the intended domain of synthesized text is unlimited. Polyglot databases have an additional problem that not only single language units have to be stored in the database, but also the concatenation points of words from foreign languages have to be accounted for. This exceeds the database size even more, so that it is worth exploring whether database size can be reduced by including only single language units in the database and handling multilingual units on synthesis time. The present work is concerned with database design for a polyglot unit selection voice. It’s main aim is to examine whether alternative methods for handling multilingual cross-word diphones result in same or better synthesis quality than including these diphones in the database. Three alternative approaches are suggested and model polyglot voices are built to test these methods. The languages included in the synthesizer are Bosnian, English and German. The output quality of the synthesized multilingual word boundary is tested on Bosnian-English and Bosnian-German word pairs in a perceptual experiment.