Selection of synthesis elements
Statistical research on phonetic elements in Slovak language
The first step, when selecting appropriate candidates for synthesis elements is thorough analysis of phonetic elements in the specific language. Philological institute in area of quantitative characteristics of Slovak language has in Slovakia considerable tradition. The main point in research is the element’s frequency analysis (frequency of elements in a text), but time sequentiality, likelihood of their association, entropy or redundancy are also the subjects of the study. Published statistical results render mostly the situation in written speech, studies of verbal speech are sparse.
We can organize elements of Slovak language into orthographic and phonetic as following [SAV]:
Orthographic elements:
- grapheme: element characterizing fundamental unit of written speech (e.g. word “včela” consists of the following graphemes: v, č, e, l, a).
- digram: element characterizing pair of adjoining graphemes (e.g. word “včela” consists of the following digrams: vč, če, el, la).
- trigram: element characterizing a triad of adjoining graphemes (e.g. word “včela” consists of the following trigrams: vče, čel, ela).
Phonetic elements:
- speech sound (phoneme): element characterizing the smallest articulatory-acoustical unit of speech (e.g. word “včela” consists of the following speech sounds: v, č, e, l, a).
- double-combination of speech sounds: element characterizing pair of adjoining speech sounds (e.g. word “včela” consists of the following double-combination of speech sounds: fč, če, el, la).
- allophone: represents phoneme in various positional modifications given by the left and right context, in which the phoneme is present.
- diphone: element used in speech signal processing. It represents section of spoken signal from the half of one phoneme to the half of the next phoneme. The name of diphone is created as conjugation of symbols from both phonemes it contains. Diphones are used for speech segmentation because, big part of acoustic information, important at phoneme identification, lies between cross-overs of phonemes. The advantage of diphones is, they store cross-over coarticulation information in their mids and the stable border parts are suitable to connect with other elements. By diphones parts of silence are also regarded. They are used in segmentation of the beginnig and end of a word, where the speech sound traverses from silence into first phoneme, or last phoneme into silence. That is the reason why words using diphone specification start and end with section of silence (e.g. word „včela“ consists of the following diphones: /f, fč, če, el, la, a/).
- triple-combination of speech sound: element characterizing a triad of adjoining phonemes (e.g. word “včela” consists of the following triple-combination of speech sounds: fče, čel, ela).
- triphone: element used in speech signal processing. It represents segment of speech signal from half of phoneme across the next whole phoneme up to half of the next phoneme. The name of triphone is created as conjugation of symbols from three phonemes it contains. As by diphones, sections of silence are also regarded. Word segmentation process using triphone is the same as by diphone segmentation, it begins and ends with a part of silence (e.g. word „včela“ consists of following triphones: /fč, fče, čel, ela, la/).
- syllable: is a phonetic unit, which involves vowel’s core plus optional starting or ending consonants, or groups of consonants. Syllable can be represented also by standalone vowel. It can contain cross-overs like: consonant – vowel, as well as vowel – consonant including most of the coarticulations and other phonologic – phonetic signs within its bounds. The length of syllable is not uniform, it varies. So far, phonologic theory doesn’t involve rules, which would strictly determinate syllable’s bounds. Therefore according to the experiences, we encounter with various definitions.
- demisyllable: divides syllable into 2 parts. The separation is made in the vowels core of syllable (consonant clusters are kept inside the syllable), by what are the problems with coarticulation during concatenation reduced. Elements are then KV or VK, where K represents none, one or more consonants and V is half of a vowel. This way, we can transcribe word “Vianoce” (Christmas) into form: [via ia no o ce e]. Their count in English language moves around 1000. It was proved, that they are appropriate for synthesis of Deutsch language, where especially clusters of consonants perform big role.
From the results of element’s frequency analysis mentioned in [SAV], most important for the projection of a diphone speech synthesizer database is the frequency analysis of double-combination of phonemes and diphones (Tab. 1). Diphone and double-combination of phonemes are elements very alike. Double-combinations of phonemes make from the aspect of frequency occurrence subset of diphones, because diphones can contain also segment of silence. Double-combinations of phonemes is an element used in phonetic research, whereby diphone finds its usage at segmentation and speech signal processing.
Double-combination of phonemes | Occurrence [%] | Double-combination of phonemes | Occurrence [%] | Double-combination of phonemes | Occurrence [%] |
---|---|---|---|---|---|
p r | 1,37 | o s | 0,82 | e n | 0,64 |
o v | 1,31 | n o | 0,80 | s k | 0,63 |
p o | 1,25 | ľ i | 0,79 | m e | 0,63 |
n a | 1,21 | l a | 0,77 | t a | 0,62 |
ň e | 1,16 | d o | 0,76 | ľ e | 0,62 |
k o | 1,10 | v e | 0,75 | r i | 0,62 |
s t | 1,10 | v o | 0,73 | j e | 0,60 |
v a | 1,06 | o r | 0,73 | o b | 0,56 |
r e | 1,05 | h o | 0,72 | e i<> | 0,55 |
r o | 1,04 | v i | 0,70 | ť i | 0,53 |
t o | 1,02 | l o | 0,70 | ď e | 0,52 |
r a | 0,96 | ť e | 0,68 | m i | 0,52 |
o u<> | 0,88 | s a | 0,66 | a k | 0,51 |
o m | 0,86 | s ť | 0,64 | ka | 0,51 |
Tab.1 The most incident double-combinations of phonemes in Slovak language.
According the rules, synthesizers are usually engineered so, they would be able to generate words and sentences from unlimited vocabulary. Building unites, from which the speech is constructed using concatenation, are most often allophones, phonemes, diphones, demisyllables or syllables. The usage of these basic units brings both, advantages, but also disadvantages.
Phoneme application as a construction unit for speech synthesis is advantage from several scopes. First, inventory of various phonemes is relatively small, so the requirements for their finding and memory usage are small. Also for each language (productive) rules can be created and with their help common sentences or words of that specific language can be automatically generated using sequence of phonemes. This process is called phonetic transcription. The disadvantage of phonemes as construction unit is, phoneme is rather logical representative of a whole group of speech sounds – allophones. Those carry also certain degree of coarticulation and therefore to get superior synthesis it is better to use exactly this building units. Also for allophones can be in synthesis drafted rules for so-called allophonic transcription, or rules, that would supplement phonetics transcription. More precisely we do the allophonic transcription, i.e. more allophonic units we will use, more perfectly synthesized speech will get. On the other hand, increasing number of allophones, requires bigger amount of storage and more rules. Reasonable compromise is the usage from 100 up to 200 of allophonic variants. It is necessary to extract selected phonemes, or allophones very carefully from a speech signal and encode into sequence of parameters convenient to control formant or LPC synthesizer. To get better quality of synthesized speech it is also good to apply some interpolating process when connecting allophones, to make abrupt changes of formants or reflex indexes smoother.
Application of diphone and demisyllable as construction units expects for the speech synthesis, that we firstly do the phonetics transcription of the synthesized message and then from a dictionary of diphones, or demisyllables, we choose and connect matching units. The diphones advantage is, they carry, as already mentioned, important coarticulactic information. Seeing that we connect the same phonemes at their linking, requirements on interpolation of a pass-over are considerably reduced. Demisyllables in contrast to diphones contain according the usage the whole starting or ending consonant cluster, which is realized more difficult by using the diphones. Demisyllables can be also efficiently used to affect the rhythm of speech. It was also proved, that significant influence onto the rhythm of speech has especially the length of phonemes inside the cluster of consonants. However, the number of diphones and demisyllables is estimated to 2000, what brings difficulties mostly with their extraction from a real speech, but also with their storing and manipulation.
Syllable as a basic building unit of speech synthesis is not used that much. The reason is, there exist to many of them, and the count of them is multiple more bigger then the count of diphones and demisyllables. Even though the coarticulation effects are naturally contained inside them, the effects are missing at the places of their connection and must be „added“ with interpolation.
While we create concrete prefaces, which should be stored in the memory and synthesized according the necessity, it is needed to arrange strings of matching phonetic symbols and add prosodic signs (marking the accent, pause, …). For each sentence of preface we create then separate images of the basic phonetic behavior. If there would be prepared corresponding string of acoustic attribute to control the formant synthesizer, or LPC synthesizer for each phonetic symbol and prosodic sign, in the stage of analysis, created string of symbols, supplemented with the image of basic tone, or image of the intensity, will carry sufficient information for speech synthesis of the message. When similar prefaces prepares designer in forward, strings of symbols can be of course changed so long, the final preface sounds closest to the natural. More difficult situation withstands, when the synthesis of speech should be processed at real-time and from before unknown written text.
Literatúra:
[SAV] ROZINAJ G. and comp.: The mission of research and development: Intelligent speech communication interface for public program Information community building – D 1.3 – Module of speech synthesis (Actual state analysis and the solution proposition), Košice, jún 2004, s. 76 – 89