PROMO - Prosody and Modification of Speech
Rozinaj Gregor, APVV LPP-078-06,
(2007-2010)
Abstract
The proposed project
deals with problems of speech
processing and synthesis. The goal of
the project is to improve existing
methods used in TTS (text-to-speech) synthesis of Slovak speech with utilization of acoustic units
concatenation and also for modification of original speech
records and as well to build a prime basis for the
system of Slovak speech synthesis that is developed
at the Department of Telecommunications at FEI STU. Nowadays, one of
the important trends in speech synthesis is to produce a bigger variability
of the synthetized
speech. One main source of
this variability is the prosody of
the speech (intonation, rhythm and stress) and also the emotional state of the speaker. Most of the previous
works in this field was aimed
for prosodic aspects of speech
and for experiments in
formant synthesis based on rules. Growth of
requirements for better understandability of the synthetic
speech caused that the research
went in for naturalness of the synthetic speech.
Many of today’s
researches deals with imitating of varieties of
the human speech and voice with emotions. The goal of
this project is to adapt to these trends and to evolve the field
of Slovak speech synthesis.
In TTS systems, the primary
objective is to mediate the text information in sound form with the
appropriate way to sound familiar to the human ear.
The TTS system consists [1] of two main blocks:
NLP (Natural Language
Processing) and DSP (Digital Signal Processing). The task of the NLP block
is processing of the input text and its transformation to data on which it is
possible to generate the speech. These data contains the sequence of phonemes
and information about the prosody. The task of the DSP block is to generate
with utilization of these data the speech, and on the higher level of synthesis
to take the synthetic speech closer to the natural speech just on by modeling
of prosody and emotions of the speech.
We term the prosody
as a features of the speech
signal, like hearable changes in the fundamental frequency, in the intensity of the
voice and in the length of the
syllables [1]. Since the prosody is
time aligned with syllables or groups of syllables,
we also call
the prosodic features as suprasegmental.
The research of the prosody
belongs to complicated tasks, because nowadays still does not exist
the basic unit of prosody
and beause of this the outcomes
of this research
are usually presented only like global
states. The basic dynamical tools of speech
is the stress,
the melody of the speech
and the rate of the speech. Defining
of them comes
out the three
features of the prosody. In an simplified way,
the intensity of the speech
accounts for the stress, the
fundamental frequency change accounts for the intonation
and the duration of the speech
segments accounts for the speech
rate.
Two main categories
of intonation modeling are phonological and acoustic-phonological
models. The phonological models represent the prosody
of the speech
expression as a sequence of abstract
units that are elements of corpus annotation. Acoustic-phonological models
interpret the shape of the fundamental
frequency curve as a superposition (or overlay) of more components. In the project, we propose
for analysis of possibilities of the acoustic-phonological
models, since they are relative to the chosen prosody modeling approach, and we intend to implement the chosen acoustic-phonological
model to the system.
In relation with speech
synthesis that is based on acoustic
units connection [2], the speech signals
can be coded
with various speech models. These models are expected to allow smooth connection of acoustic units in the border of
the units. The discontinuities in the prosody (for
example tone period, energy) would cause
unnatural sounding speech.
The sinusoidal models are very suitable for prosodic
manipulation. They are parametric models and it is easy
to change the quantities that are directly related to the prosodic features,
like the intonation (movement of the fundamental
frequency), the speech rate (duration) and the stress (intensity).
In the synthesis with utilization of the acoustical
units, the sinusodal models are mostly used as
the end blocks, that allow smooth
concatenation of the acoustic units
and that adjust the prosody.
Especially for speech processing was
proposed the HNM (harmonic plus noise) model [3], [4], that belongs to the
family of the sinusoidal models. This model assumes that the speech signal
consists of a harmonic and a noise part. The quasi-periodic components of the
speech signal are considered as the harmonic part and the non-periodic
component are considered as the noise. These two types of components are in the
frequency domain separated with the time-varying parameter called maximum
voiced frequency Fm. The lower bandwidth (below Fm) is represented with harmonic sinusoids and the upper
bandwidth is represented with the modulated noise components. Although these
assumptions are not completely valid in the terms of speech production, they
are useful in the terms of speech perception – they lead to the simple model of
speech, that provides high-quality synthesis and
modification of the speech signal. Sinusoidal models, especially the HNM model
is very suitable for modification of the speech and it is possible to change
the prosodic features with it, like the voice pitch, the rate of the speech and
the stress, that play a significant role in the emotional speech, too. It is
natural to use this model in changing of the emotions of the speaker. These
modifications were made in work described in [5].
Nowadays, there is no Slovak speech synthetiser that would fulfill demanding requirements to
the natural sounding speech synthesis. The best results reach the TTS synthetisers based on acoustic units concatenation approach.
However, this quality is in the biggest amount influenced by the quality of the
corpus database and by the quality of the acoustic units
concatenation. On the Department of Telecommunication, it is developed a corpus
synthetiser that gives good conditions for
high-quality speech synthesis in the case of high-quality corpus database and
of a good post-processing in the form of prosody and emotions modeling of the synthetised speech. The problems of the speaker emotions
modeling are very closely related to the speech prosodic features modeling and
because of that, they are very language specific. Systems like this, that would
model the prosody of the Slovak speech with emotions, do not exist since now.
The goal of our research-developing
intention is to create a basis for high-quality synthesis of Slovak speech
based on acoustic units concatenation, that in big
amount depends on the quality corpus databases, and to create post-processing
block that will work for modifications of prosody and for modification of
speaker emotions. This will allow the possibility to radically increase the
quality of the understandability and naturalness of the synthetic speech. Our
approach is based on HNM modeling of the speech, so we can use this block not
only for speech synthesis, but also for modification of pre-recorded speech. It
is planned to build a specialized laboratory at the Department of
Telecommunications for quality recording of the speech on the professional
level, where will be recorded domain oriented corpuses for corpus synthetiser.
The first main goal bargains
for research and the following development
of the systems
that will be able to modify
the prosodic features of the
speech and the emotions of the
speaker, that will be based on HNM (harmonic plus noise) modeling of speech. Development
of these systems assumes the analysis, choice
and implementation of the chosen acoustic-phonological
model. It is also intended to analyse of the
possibilities of the speaker voice conversion. Next, it is planned
to build the system for corpus databases creation for the Slovak speech synthesis, that will be
made of sinusoidal
parameters instead of classical acoustic
units in PCM format. The sinusoidal parameters would be able to allow
fast and easy concatenation of the units and the
modification of the speech signal
for the purposes
of the modeling of the prosody
and speaker emotions.
The next goal is to build
and operate the laboratory for recording of the
speech databases on the proffesional level, needed for
high-quality TTS speech synthesis. This laboratory will be at the
same time used for pedagogical
activities at the Department of Telecommunications for the subject Digital
speech processing and other related subjects.
The project
simultaneously plan on close connection to other projects running at the
Department of Telecommunications,
specifically with VEGA
1/3110/06 (Non-linear processing
of multimedia and biomedicine signals in telecommunications), VEGA 1/3094/06 (Algorithms
and methods of digital signal processing, processes of control of
converged networks platform and NGN and new generation
of multimedia services and applications) and
VTP 1003/2003 (Virtual
reality of multimedia speech synthesis).
Fig. 1. Schedule of the project
2007-2008 |
2008-2009 |
2009-2010 |
|||||||||||||||||||||||||||||||||||
3. |
4. |
5. |
6. |
7. |
8. |
9. |
10. |
11. |
12. |
1. |
2. |
3. |
4. |
5. |
6. |
7. |
8. |
9. |
10. |
11. |
12. |
1. |
2. |
3. |
4. |
5. |
6. |
7. |
8. |
9. |
10. |
11. |
12. |
1. |
2. |
||
Building of the laboratory |
|
||||||||||||||||||||||||||||||||||||
|
Performance of the laboratory |
||||||||||||||||||||||||||||||||||||
|
Recording of the classical domain oriented corpuses |
||||||||||||||||||||||||||||||||||||
Development of the system for sinusoidal corpuses
creation |
|
||||||||||||||||||||||||||||||||||||
|
Creation of sinusoidal corpuses |
||||||||||||||||||||||||||||||||||||
|
Development of the system for prosodic modification
of the speech |
|
|||||||||||||||||||||||||||||||||||
|
Analysis and implementation of acoustic.-phonolog.
model of intonation |
|
|||||||||||||||||||||||||||||||||||
|
Development of the system for emotions creation |
||||||||||||||||||||||||||||||||||||
|
Analysis of the possibilities of voice conversion |
||||||||||||||||||||||||||||||||||||
References:
[1] Dutoit, T.
(1997). An Introduction to Text-to-Speech Synthesis, Kluwer Academic
Publisher.
[2] Furui, S.. et al.: „Speech-to-Text and
Text-to-Speech Summarization
of Spontaneous Speech“, IEEE Trans. On Speech and Audio
Processing, vol. 12, no. 4, pp. 401-408, January 2004
[3] Stylianou, Y. „Applying the harmonic plus
noise model in concatenative
speech
synthesis“, IEEE Trans. on Speech and
Audio Processing, vol. 9, no. 1, pp. 21 -29, January 2001
[4] Stylianou, Y.: „Applying the Harmonic plus Noise
Model in
Concatenative Speech Synthesis,” IEEE Trans. Acoustics,
Speech and Signal Processing, vol. 9, no. 1, pp. 21-29, January 2001.
[5] Stylianou, Y., Laroche, J., Moulines, E.: "High-Quality Speech
Modification based on Harmonic + Noise
Model", Proc. EUROSPEEECH, pp. 451-454, 1995