Extracting time-frequency feature for visual identification of Persian vowels

Number of pages: 102 File Format: word File Code: 32129
Year: 2013 University Degree: Master's degree Category: Electronic Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Extracting time-frequency feature for visual identification of Persian vowels

    Master's Thesis in Electrical Engineering, Electronics Orientation

    Abstract

    In this thesis, a method for identifying Persian vowels in monosyllabic words is presented. For this purpose, after separating the image frames and selecting the frames that were related to the pronunciation of the vowel in the monosyllabic word, as well as extracting the area around the lips, various features such as cosine coefficients, wavelet coefficients, and MFCC coefficients were extracted to recognize the vowels in monosyllabic words. After that, we reduced the features by the LSDA feature reduction method and changed the feature size to 25. Finally, the most effective features for identification were determined. In this research, the database containing monosyllabic words, which were spoken by different speakers and included 580 videos, was used. We used 381 videos for training and 199 videos for testing. The extracted features were applied as input to a two-layer neural network with 20 neurons in the middle layer and one neuron in the output. We used sigmoid tangent activation function in the middle layer and linear function in the output and we used gradient descent method with variable training rate to train the network. The best recognition rate was 95.75, which was obtained from the calculation of MFCC coefficients from 1/4 of the vector of DCT coefficients after zigzag scanning of the cosine coefficients matrix.

    Key words:

    lip reading, vowel recognition, time-frequency features, feature dimensionality reduction, neural networks

    1 Introduction

    Mankind has been familiar with the fact that for a better understanding of speech, he can pay attention to the movements of the speaker's lips and mouth during speech and when pronouncing words. Probably all of us unconsciously use this non-sound aspect of speech to some extent and when the listening environment is full of noise and noise, we pay more attention to the speaker's lip movements. This matter is more important for those who have defects in their hearing system. In addition, lip movements or visual speech signals can significantly improve the accuracy of audio speech recognition systems, especially in noisy environments. Synchronizing lip movements and speech sound, eliminating the delay error between sound and image, and automatic video dubbing are other applications of this category.

    There are many people who have damage to the audio system and are unable to communicate with others due to the lack of a proper voice. Human speech has been repeated many times in audio and visual form in nature. Audio speech refers to the waveform produced by the speaker and visual speech refers to the movements of the lips and tongue and the muscles in the face. In vocal speech, the main unit is called phoneme [1]. In the visual domain, the main unit of mouth movements is called vishm[2], which is the smallest visual component of speech. There are many audio sounds that are visually ambiguous. These sounds are grouped into a similar class that represents a vise. There is a one-to-one mapping between faces and visimas, that is, it is possible to consider sets of faces that have a similar effect on the shape of the mouth. In the following tables, the grouping of visas in English and Farsi is given.

    (The tables are available in the main file)

    Generally, there are three methods for speech recognition, including speech recognition [1], speech recognition [2], speech recognition [3], and in this research speech recognition is discussed.

    1-2 structure of the thesis

    In different chapters of this thesis, visual speech recognition methods have been examined. In the first chapter, introductions about speech recognition were given. In the second chapter, the research conducted in the field of visual recognition of speech and different methods for doing this work has been discussed. In the third chapter, different methods of separating the mouth from the rest of the face have been introduced, so that by using these methods, in addition to reducing the size of the images, we can reduce the complexity and dimensions of the features.In the third chapter, different methods of separating the mouth from the rest of the face have been introduced, so that by using these methods, we can avoid the complexity and large dimensions of the features in addition to reducing the size of the images. In the fourth chapter, the method of calculating and extracting the frequency-time characteristics of the desired area of ??the mouth from different video frames and their performance by changing the number of selected frames and the size of the images with one of the feature reduction methods is also examined. that these extracted features have been applied to the neural network for recognition and also the database that we used in this research has been introduced. In this thesis time-frequency features extracted from the images of the speaker's mouth and extracted features are used as input parameters to a neural network system for recognition. Because we used the video images so we got to work a different number of video frames. First separated the frames manually and then selected the area around the mouth and desired features for the area of ??each frame obtained. To improve performance and reduce the dimensions of features, we used dimensionality reduction technique LSDA. Using this approach we have reduced the size of our feature. The database consists of different individuals, that have been uttered monosyllabic words 2 or 3 times. Finally, the vowel recognition rate of 95.75 was achieved.

  • Contents & References of Extracting time-frequency feature for visual identification of Persian vowels

    List:

    The first chapter: Introduction 1. Introduction 1-1 Introduction 2. 1-2 Thesis structure 4. The second chapter: An overview of the conducted researches 5. 2-1 Introduction 6. 2-2 Active boundary models 6. 2-2-1 Energy function 7. 2-2-2 Energy minimization 9. 2-3 Active shape models 12. 2-4 Flexible models 2-4-1 lip model 21

             2-6-2 color transformation.

    2-8 Discrete cosine transform 26 2-8-1 Modeling based on 3-D DCT 26 2-8-1-1 Extraction of lip motion feature 28. 2-8-2 Feature extraction from the target area. 29 2-8-2-1 Extraction of visual features. 30 2-8-3 Cosine transform and LSDA. 31 2-8-3-1 Preprocessing. 2-8-3-2 DCT method.31 2-8-3-3 DCT + PCA.31 2-8-3-4 DCT +LDA 32 Bezier curve. 35

    2-10 separation of the lip area with Cam-Meniz. 37

    Chapter three: Mouth area extraction methods and detection systems 39 3-1 Introduction 40 3-2 Lip area detection 41 3-2-1 Lip and skin color composition analysis 41 3-2-2 Color and saturation and light intensity (HSV) 42 3-2-3 Removal Red component 43 3-2-4 Cummins algorithm 43 3-2-4-1 Algorithm implementation 44 3-2-5 Brightness and binarization 45 3-2-6 Combined methods 45 3-3 Classification and identification methods 47

    3-3-1 neural network 47 3-3-1-1 feedforward networks 48 3-3-1-2 error back propagation algorithm 48 3-3-2 hidden Markov model Chapter 4: Extracting features and implementation of the proposed method and introducing the database. 51

    4-1 Database. 52

    4-1-1 Separation of recorded videos. 53

    4-2 Extracted features. 54

    4-4-1 Framing 61 4-4-2 Windowing 62 4-4-3 Discrete Fourier transform 62 4-4-4 Mel scale 62 4-4-5 Discrete cosine transform 64 4-4-5-1 Calculation Cosine and Violet coefficients. 65

    4-4-5-2 Calculating Mel-frequency coefficients. 65

    4-5 Finding the center of the lip and extracting an area around the lip. 66

    4-5-1 Zigzag scan. 67

    4-5-2-1 Using the Logsigmoid function and changing the training algorithm 70 4-5-2-2 Using the Tansigmoid function and the momentum algorithm 4-6 Extracting features from different images 4-6-1 Extracting features from new images 4-6-2 Mel coefficients Frequency and cosine coefficients. 72

    4-7 Reducing the number of frames and reducing the size of images. 73

            4-7-1 Calculating MFCC coefficients

    73 4-7-3 Reducing the number of frames and reducing the size of images with the resize command

    [1] T Chen, "Audiovisual speech processing". IEEE Signal Processing Magazine, Vol.18(1), pp: 9–21, (2001).   

    [2] Sadeghi, Vahida Al-Sadat, "Vowel Recognition in Persian Monosyllabic and Bisyllabic Words," Master's Thesis, Semnan University, 1385

    [3] E.D.Petajan, "Automatic Lipreading to Enhance Speech Recognition," PhD thesis, University of Illinois at Urbana-Champain, 1984.

    [4] M. Kass, A.Witkin, and Terzopoulos, "Snakes: Active Contour Models," International Journal of Computer Vision, pp.321-331,1988.

    [5] C. Bregler and Y. Konig, "Eigenlips For Robust Speech Recognition," in Proc. IEEE Conf. Acoustics, Speech and Signal Processing, pp.669-672, 1994.

    [6] Takeshi Saitoh and Ryosuke Konishi, "Word Recognition based on Two Dimensional Lip Motion Trajectory," international Symposium on Intelligent Signal Processing and Communication System (ISPACS2006), pp.287-290. 12-15 Dec, 2006

     [7] Mir Hadi Seyed Arabi, Ali Agha Golzadeh, Sohrab Khan Mohammadi, "Automatic tracking of lip movements and its special points using active contour", 14th Iranian Electrical Engineering Conference 2006 ICEE.

    [8] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, "Active Shape Models-Their Training and Application," Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38-59, Jan. 1995

    [9] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, "Extraction of visual features for lipreading," IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 213, Feb. 2002.

    , S.H.Leung, et al. "A real-time automatic lipreading system,"

    International Symposium on Circuits and Systems, No.2, pp.101-104, IEEE, Vancouver, Canada, May 2004.

    [12] D. Thambiratnam, T. Wark, S. Sridharan and V. Chandran, "Speech Recognition in Adverse Environments using Lip Information," Speech and Image Technologies for Computing and Telecommunications, IEEE TENCON 1997, Vol.1, pp.149-152, 4Dec,1997

    [13] Tanveer A Faruquie, Abhik Majumdar, Nitendra Rajput, L V Subramaniam,"Large Vocabulary Audio-Visual Speech Recognition Using Active Shape Models," Pattern Recognition, 2000, 15th International Conference, Vol.3, pp.106-109,2000.

    [14] A.L.Liew, et al,"Lip contour extraction from color images using a deformable model," The Journal of the Pattern Recognition Society, No.35, 2949-2962, 2002

    [15] Stefan Horbelt, Jean-Luc Dugelay, "Active Contours For Lipreading Combining With Templates," 15th GRETST Symposium on Signal and Image processing, pp.18-22, September 1995, France.

    [16] Mohammad Mehdi Hosseini, Abdorreza Alavi Gharahbagh and Sedigheh Ghofrani, "Vowel Recognition by Using the Combination of Haar Wavelet and Neural Network," KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems, Part I, pp.331-339, 2010.

    [17] M.M, Hosseini, S.Ghofrani, "Automatic Lip Extraction Based On Wavelet Transform," IEEE GCIS, pp.393-396, 2009, China.

    [18] Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan, "A PCA based Manifold Representation for Visual Speech Recognition," In: CIICT 2007, Proceedings of the China-Ireland International Conference on Information and Communication Technologies, 28-29 August 2007, Dublin, Ireland.

    [19] Y. L. Tian and T. Kanade," Robust Lip Tracking by Combining Shape, Color and Motion," Proc. of the Asian Conference on Computer Vision, pp.1040-1045, 2000.

Extracting time-frequency feature for visual identification of Persian vowels