An Investigation of nonlinear speech synthesis and pitch modification techniques
Speech synthesis technology plays an important role in many aspects of man–machine interaction, particularly in telephony applications. In order to be widely accepted, the synthesised speech quality should be as human–like as possible. This thesis investigates novel techniques for the speech signal generation stage in a speech synthesiser, based on concepts from nonlinear dynamical theory. It focuses on natural–sounding synthesis for voiced speech, coupled with the ability to generate the sound at the required pitch. The one–dimensional voiced speech time–domain signals are embedded into an appropriate higher dimensional space, using Takens’ method of delays. These reconstructed state space representations have approximately the same dynamical properties as the original speech generating system and are thus effective models. A new technique for marking epoch points in voiced speech that operates in the state space domain is proposed. Using the fact that one revolution of the state space representation is equal to one pitch period, pitch synchronous points can be found using a Poincar´e map. Evidently the epoch pulses are pitch synchronous and therefore can be marked. The same state space representation is also used in a locally–linear speech synthesiser. This models the nonlinear dynamics of the speech signal by a series of local approximations, using the original signal as a template. The synthesised speech is natural–sounding because, rather than simply copying the original data, the technique makes use of the local dynamics to create a new, unique signal trajectory. Pitch modification within this synthesis structure is also investigated, with an attempt made to exploit the ˇ Silnikov–type orbit of voiced speech state space reconstructions. However, this technique is found to be incompatible with the locally–linear modelling technique, leaving the pitch modification issue unresolved. A different modelling strategy, using a radial basis function neural network to model the state space dynamics, is then considered. This produces a parametric model of the speech sound. Synthesised speech is obtained by connecting a delayed version of the network output back to the input via a global feedback loop. The network then synthesises speech in a free–running manner. Stability of the output is ensured by using regularisation theory when learning the weights. Complexity is also kept to a minimum because the network centres are fixed on a data–independent hyper–lattice, so only the linear–in–the–parameters weights need to be learnt for each vowel realisation. Pitch modification is again investigated, based around the idea of interpolating the weight vector between different realisations of the same vowel, but at differing pitch values. However modelling the inter–pitch weight vector variations is very difficult, indicating that further study of pitch modification techniques is required before a complete nonlinear synthesiser can be implemented.