Speech Processing
keywords: TTS, speech synthesis, automatic labeling, database design, concatenation, unit selection, intonation modeling, pros

Text to speech

keywords

TTS, speech synthesis, automatic labeling, database design, concatenation, unit selection, intonation modeling, prosody modification

Abstract:

TTS (Text-To-Speech) synthesis is a process that takes raw text as an input and converts it to a speech audio signal. The goal of this process is to create speech that is intelligible and maintains the original meaning of the text. At the same time the speech must be natural, a listener should not be able to say whether it is computer or human speaking. The process can be divided into two phases - high level synthesis (or natural language processing - NLP), which generates target specification that serves as an input for the second phase - low level synthesis, which creates the synthetic signal by a DSP (digital signal processing) process. The output from the first phase is a phonetic transcription together with acoustic parameters of identified speech units. The target specification is mainly given by the fundamental frequency curve, which shape is a result of intonation modeling process. The synthesis in DSP phase may be achieved by couple of ways; the state of art synthesis is the concatenative one. This type of synthesis uses prerecorded speech units that are re-sequenced and joined together. It is possible to have just one realization of each speech unit in the database or to have them many. A so called diphone synthesizer uses the first mentioned type of database, in which monotone diphones are used as basic speech units. This type of synthesizer produces speech of high and constant intelligibility but of rather unnatural sound. On the other hand, corpus based synthesis that uses many speech units of the same phonetic type sounds natural, but the constant high intelligibility is not always guaranteed. The diphone synthesis achieves the target specification solely by signal modification; the corpus synthesis reduces the amount of modification by proper unit selection. The quality of modification depends on the method used; the most frequent ones are overlap and add techniques and sinusoidal modeling. The diphone synthesis has reached its limits; however, the development of corpus based synthesis is ongoing. A suitable database of corpus synthesizer needs to be phonetically rich or balanced. This can be reached by proper database design. Such a database is usually of more than one hour length. The quality of synthesis can be raised by targeting the content to a limited domain usage. To use the database in the synthesizer, the borders of the units – phonemes must be labeled. This is done by a segmentation process that can be either manual or automatic. Manual process is very time consuming and error prone task. On the other hand, the automatic methods, using DTW (Dynamic Time Warping) or acoustic modeling by HMM (Hidden Markov Models), are not so time demanding and cause systematic errors that can be eliminated by further processing. All of the mentioned problems of TTS have been successfully solved at our department.