Research Orientation

Speech recognition and analysis


Human Computer Interaction, Speech recognition, Speech synthesis, Data mining, Microphone array, Mobile, Java,


At the department we have built for over 7 years a good theoretical and practical background about all DTW, HMM and NN techniques and HTK and SPHINX systems as well. We managed to proceed from simple tasks like the recognition utilizing word models with the dictionary size of several dozen to the currently supported several thousand words modeled by tied context depended, speaker independent phoneme models. In our experiments and practical realizations we have used MASPER or REFREC 0.96 training procedures (utilizing HTK facility) producing various kinds of models either of speech units or non speech events. As the speech databases we make use of the Slovak SPEECHDAT and MOBILDAT databases trained and evaluated separately or even together producing hybrid models (fixed line and mobile models). Modifications to standard training procedures were made which resulted in the improvements to the overall results. Regarding the SPHPINX system, context depended as well as independent phoneme models were derived by the SphixTrain procedure modified for the Slovak language and MOBILDAT database. In both systems similar results regarding the accuracy were achieved compared to other researcher’s reports in well-known universities. For the practical application we use ATK software package and SPHINX 3.5 or 4 versions. The main achievement of our several years long effort was the successful construction of an recognition system being capable of recognizing about 1300 words in the real time. This application was furthermore incorporated into a more complex system that may serve as an information kiosk; currently 2 services are fully operable: train departure information and weather forecast. Moreover experiments with cross-language recognition were made, namely with Italian language. Except speech recognition we have been successfully dealing with other crucial analytical problems, especially: speech detection (we produced several VAD algorithms outperforming several well-know or experimental ones) and speaker identification. In the near future we contemplate to focus our attention to continuous speech recognition problem, which lies mostly in an application part of the recognition problem. This would encompass statistical language modeling, improvements to computational efficiency; phoneme based 2-layer recognition, etc. Next we would like to increase the robustness and the accuracy of HMM models mainly by the modifications to HMM model training, incorporation of other models and modifications to the speech feature extraction process. The hybrid HMM-NN approach is also very appealing.