PROMO - Prosody and Modification of Speech

Rozinaj Gregor, APVV LPP-078-06, (2007-2010)

 

Abstract

 

            The proposed project deals with problems of speech processing and synthesis. The goal of the project is to improve existing methods used in TTS (text-to-speech) synthesis of Slovak speech with utilization of acoustic units concatenation and also for modification of original speech records and as well to build a prime basis for the system of Slovak speech synthesis that is developed at the Department of  Telecommunications at FEI STU. Nowadays, one of the important trends in speech synthesis is to producebigger variability of the synthetized speech. One main source of this variability is the prosody of the speech (intonation, rhythm and stress) and also the emotional state of the speaker. Most of the previous works in this field was aimed for prosodic aspects of speech and for experiments in formant synthesis based on rules. Growth of requirements for better understandability of the synthetic speech caused that the research went in for naturalness of the synthetic speech. Many of today’s researches deals with imitating of varieties of the human speech and voice with emotions. The goal of this project is to adapt to these trends and to evolve the field of Slovak speech synthesis.

            In TTS systems, the primary objective is to mediate the text information in sound form with the appropriate way to sound familiar to the human ear. The TTS system consists [1] of two main blocks: NLP (Natural Language Processing) and DSP (Digital Signal Processing). The task of the NLP block is processing of the input text and its transformation to data on which it is possible to generate the speech. These data contains the sequence of phonemes and information about the prosody. The task of the DSP block is to generate with utilization of these data the speech, and on the higher level of synthesis to take the synthetic speech closer to the natural speech just on by modeling of prosody and emotions of the speech.

            We term the prosody asfeatures of the speech signal, like hearable changes in the fundamental frequency, in the intensity of the voice and in the length of the syllables [1]. Since the prosody is time aligned with syllables or groups of syllables, we also call the prosodic features as suprasegmental. The research of the prosody belongs to complicated tasks, because nowadays still does not exist the basic unit of prosody and beause of this the outcomes of this research are usually presented only like global states. The basic dynamical tools of speech is the stress, the melody of the speech and the rate of the speech. Defining of them comes out the three features of the prosody. In an simplified way, the intensity of the speech accounts for the stress, the fundamental frequency change accounts for the intonation and the duration of the speech segments accounts for the speech rate.

            Two main categories of intonation modeling are phonological and acoustic-phonological models. The phonological models represent the prosody of the speech expression assequence of abstract units that are elements  of corpus annotation. Acoustic-phonological models interpret the shape of the fundamental frequency curve assuperposition (or overlay) of more components. In the project, we propose for analysis of possibilities of the acoustic-phonological models, since they are relative to the chosen prosody modeling approach, and we intend to implement the chosen acoustic-phonological model to the system.

            In relation with speech synthesis that is based on acoustic units connection [2], the speech signals can be coded with various speech models. These models are expected to allow smooth connection of  acoustic units in the border of the units. The discontinuities in the prosody (for example tone period, energy) would cause unnatural sounding speech.

            The sinusoidal models are very suitable for prosodic manipulation. They are parametric models and it is easy to change the quantities that are directly related to the prosodic features, like the intonation (movement of the fundamental frequency), the speech rate (duration) and the stress (intensity). In the synthesis with utilization of the acoustical units, the sinusodal models are mostly used as the end blocks, that allow smooth concatenation of the acoustic units and that adjust the prosody.

Especially for speech processing was proposed the HNM (harmonic plus noise) model [3], [4], that belongs to the family of the sinusoidal models. This model assumes that the speech signal consists of a harmonic and a noise part. The quasi-periodic components of the speech signal are considered as the harmonic part and the non-periodic component are considered as the noise. These two types of components are in the frequency domain separated with the time-varying parameter called maximum voiced frequency Fm. The lower bandwidth (below Fm) is represented with harmonic sinusoids and the upper bandwidth is represented with the modulated noise components. Although these assumptions are not completely valid in the terms of speech production, they are useful in the terms of speech perception – they lead to the simple model of speech, that provides high-quality synthesis and modification of the speech signal. Sinusoidal models, especially the HNM model is very suitable for modification of the speech and it is possible to change the prosodic features with it, like the voice pitch, the rate of the speech and the stress, that play a significant role in the emotional speech, too. It is natural to use this model in changing of the emotions of the speaker. These modifications were made in work described in [5].

Nowadays, there is no Slovak speech synthetiser that would fulfill demanding requirements to the natural sounding speech synthesis. The best results reach the TTS synthetisers based on acoustic units concatenation approach. However, this quality is in the biggest amount influenced by the quality of the corpus database and by the quality of the acoustic units concatenation. On the Department of Telecommunication, it is developed a corpus synthetiser that gives good conditions for high-quality speech synthesis in the case of high-quality corpus database and of a good post-processing in the form of prosody and emotions modeling of the synthetised speech. The problems of the speaker emotions modeling are very closely related to the speech prosodic features modeling and because of that, they are very language specific. Systems like this, that would model the prosody of the Slovak speech with emotions, do not exist since now.

The goal of our research-developing intention is to create a basis for high-quality synthesis of Slovak speech based on acoustic units concatenation, that in big amount depends on the quality corpus databases, and to create post-processing block that will work for modifications of prosody and for modification of speaker emotions. This will allow the possibility to radically increase the quality of the understandability and naturalness of the synthetic speech. Our approach is based on HNM modeling of the speech, so we can use this block not only for speech synthesis, but also for modification of pre-recorded speech. It is planned to build a specialized laboratory at the Department of Telecommunications for quality recording of the speech on the professional level, where will be recorded domain oriented corpuses for corpus synthetiser.

The first main goal bargains for research and the following development of the systems that will be able to modify the prosodic features of the speech and the emotions of the speaker, that will be based on HNM (harmonic plus noise) modeling of speech. Development of these systems assumes the analysis, choice and implementation of the chosen acoustic-phonological model. It is also intended to analyse of the possibilities of the speaker voice conversion. Next, it is planned to build the system for corpus databases creation for the Slovak speech synthesis, that will be made of sinusoidal parameters instead of classical acoustic units in PCM format. The sinusoidal parameters would be able to allow fast and easy concatenation of the units and the modification of the speech signal for the purposes of the modeling of the prosody and speaker emotions.

The next goal is to build and operate the laboratory for recording of the speech databases on the proffesional level, needed for high-quality TTS speech synthesis. This laboratory will be at the same time used for pedagogical activities at the Department of Telecommunications for the subject Digital speech processing and other related subjects.

The project simultaneously plan on close connection to other projects running at the Department of Telecommunications, specifically with VEGA 1/3110/06 (Non-linear processing of multimedia and biomedicine signals in telecommunications), VEGA 1/3094/06 (Algorithms and methods of digital signal processing, processes of control of converged networks platform and NGN and new generation of multimedia services and applications) and VTP 1003/2003 (Virtual reality of multimedia speech synthesis).

 

Fig. 1. Schedule of the project

 

2007-2008

2008-2009

2009-2010

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

1.

2.

Building of the laboratory

 

 

Performance of the laboratory

 

Recording of the classical domain oriented corpuses

Development of the system for sinusoidal corpuses creation

 

 

Creation of sinusoidal corpuses

 

Development of the system for prosodic modification of the speech

 

 

Analysis and implementation of acoustic.-phonolog. model of intonation

 

 

Development of the system for emotions creation

 

Analysis of the possibilities of voice conversion

 

References:

 

[1]        Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis, Kluwer                                 Academic Publisher.

[2]        Furui, S.. et al.: „Speech-to-Text and Text-to-Speech Summarization

of Spontaneous Speech“, IEEE Trans. On Speech and Audio Processing, vol. 12, no. 4, pp. 401-408, January 2004

[3]        Stylianou, Y. „Applying the harmonic plus noise model in concatenative

speech synthesis“, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 1, pp. 21 -29, January 2001

[4]        Stylianou, Y.: „Applying the Harmonic plus Noise Model in

Concatenative Speech Synthesis,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 9, no. 1, pp. 21-29, January 2001.

[5]        Stylianou, Y., Laroche, J., Moulines, E.: "High-Quality Speech

Modification based on Harmonic + Noise Model", Proc. EUROSPEEECH, pp. 451-454, 1995