Extraction of time-frequency feature for visual identification of Persian vowels

Number of pages: 107 File Format: word File Code: 30928
Year: 2013 University Degree: Master's degree Category: Electronic Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Extraction of time-frequency feature for visual identification of Persian vowels

    Master's Thesis in Electrical Engineering, Electronics Orientation

    Abstract

    In this thesis, a method for identifying Persian vowels in monosyllabic words is presented. For this purpose, after separating the image frames and selecting the frames that were related to the pronunciation of the vowel in the monosyllabic word, as well as extracting the area around the lips, various features such as cosine coefficients, wavelet coefficients, and MFCC coefficients were extracted to recognize the vowels in monosyllabic words. After that, we reduced the features by the LSDA feature reduction method and changed the feature size to 25. Finally, the most effective features for identification were determined. In this research, the database containing monosyllabic words, which were spoken by different speakers and included 580 videos, was used. We used 381 videos for training and 199 videos for testing. The extracted features were applied as input to a two-layer neural network with 20 neurons in the middle layer and one neuron in the output. We used sigmoid tangent activation function in the middle layer and linear function in the output and we used gradient descent method with variable training rate to train the network. The best recognition rate was 95.75, which was obtained from the calculation of MFCC coefficients from 1.4 of the vector of DCT coefficients after scanning the zigzag matrix of cosine coefficients.

    Key words:

    lip reading, vowel recognition, time-frequency features, feature dimension reduction, neural networks. The speaker should pay attention while speaking and pronouncing the words. Probably all of us unconsciously use this non-sound aspect of speech to some extent and when the listening environment is full of noise and noise, we pay more attention to the speaker's lip movements. This matter is more important for those who have defects in their hearing system. In addition, lip movements or visual speech signals can significantly improve the accuracy of audio speech recognition systems, especially in noisy environments. Synchronizing lip movements and speech sound, eliminating the delay error between sound and image, and automatic video dubbing are other applications of this category.

    There are many people who have damage to the audio system and are unable to communicate with others due to the lack of proper voice. Human speech has been repeated many times in audio and visual form in nature. Audio speech refers to the waveform produced by the speaker and visual speech refers to the movements of the lips and tongue and the muscles in the face. In vocal speech, the main unit is called phoneme [1]. In the visual domain, the main unit of mouth movements is called vishm[2], which is the smallest visual component of speech. There are many audio sounds that are visually ambiguous. These sounds are grouped into a similar class that represents a vise. There is a one-to-one mapping between faces and visimas, that is, it is possible to consider sets of faces that have a similar effect on the shape of the mouth. In the following tables, the grouping of idioms in English and Farsi is given [1], [2].

    10

    th,dh

    3

    E

    11

    t,d

    4

    I

    12

    k,g

    5

    O

    13

    sh,zh

    6

    U

    14

    s, z

    7

     

    Table 1-2 grouping of visages in Persian language

    1. F, and

    5. R

    9. A

    2. Th, Q, S, Z, Z, Z, Z

    6. C, C, G, K, N, T, D, Y, T

    10. ?

    3. Zh, Sh

    7. E

    11. ?

    4. B, P, M

    8. 12. O

    In general, there are three methods for speech recognition, including voice recognition of speech [3], visual recognition of speech [4], audio and visual recognition of speech [5], which is discussed in this research.In the first chapter, the introduction about speech recognition was stated. In the second chapter, the research conducted in the field of visual recognition of speech and different methods for doing this work has been discussed. In the third chapter, different methods of separating the mouth from the rest of the face have been introduced, so that by using these methods, we can avoid the complexity and large dimensions of the features in addition to reducing the size of the images. In the fourth chapter, the method of calculating and extracting the frequency-time characteristics of the desired area of ??the mouth from different video frames and their performance by changing the number of selected frames and the size of the images with one of the feature reduction methods is also examined. that these extracted features have been applied to the neural network for diagnosis and also the database that we used in this research has been introduced.

     

     

     

     

     

     

     

     

     

    2-1 Introduction

    Image recognition of speech or in other words, lip reading consists of two parts, firstly extracting features from lip images and then classifying (classifying) the features is Two image-based and model-based methods can be used to extract image features. In the image-based method, the features are extracted directly by applying mathematical transformations such as Fourier transform [6], wavelet transform [7], discrete cosine transform [8], special component analysis [9], linear component analysis [10] on the images. The problem of these methods is the large size and repetitiveness of the data and the sensitivity to rotation and displacement of the lip. In the model-based method, a model of the lip is created and described by a small set of parameters, such as active shape models [11], active boundary models [12], flexible models [13], the advantage of this method is to express the features in small dimensions and the model is unaffected by the image brightness, rotation, size and displacement of the lip. The contour is active. Petajan [14] was probably the first researcher to develop the lip reading system [3]. The active boundary model is modeled by an open or closed curve with a number of control points near the image of the object whose shape we want to extract. For its formability, several energy factors are considered and by minimizing these energies, the curve takes the necessary form. This model was introduced by Gass and his colleagues [4], who named this model snake because of the similarity of the contour movement [15] to snake crawling [16]. The snake can be expressed by a number of points, internal elastic energy [17] or energy based on the outer edge. 2-2-1 Energy function A snake can be represented by n points as Vi= (xi, yi), i=0, 1, 2, ...., n-1. The energy function of the snake is expressed as follows. E*snake= (V(s)) ds= (V (s)) + e image (v (S)) + e con (v (s))) DS

    Relationship (2-1)

    Relationship (2-2) E External = EIMage + E con

    Relationship (2-3) ENTRAL = E cont + e Curv

    The external energy is composed of the sum of the image energy and the energy of the external restriction [18] that is applied by the user. The internal energy is the sum of the snake's contour energy and the snake's bending energy [19]. (s)||2)/2            

    Large values ??of (s) ? and (s) ? will increase the internal energy of the snake when it expands too much, and their small values ??will put less restrictions on the size and shape of the snake.

  • Contents & References of Extraction of time-frequency feature for visual identification of Persian vowels

    List:

    Chapter One: Introduction 1

    1-1 Introduction ..2

    1-2 Structure of Thesis 4

    Chapter Two: Review of Research 5

    2-1 Introduction ..6

    2-2 Active Frontier Models 6

    2-2-1 Energy function 7. 2-2-2 Energy minimization 9 2-3 ??Active form models 12 2-4 Flexible models 16 2-4-1 Lip model 16 2-4-2 Cost function formulation 17

             2-4-3 Optimizing model parameters. 2-7 Special component analysis. 23

    2-7-1 Mathematical background of EM-PCA. 24

    2-7-2 Manifold production from input image. 24

    2-8 Discrete cosine transform. 26

    2-8-1 Modeling based on 3-D DCT. 26

            2-8-1-1 Lip movement feature extraction. 27

    2-8-1-2 Network-based movement feature extraction. 27

    2-8-1-3 Contour-based movement feature extraction. 28

    2-8-2 Feature extraction from the target area. 29

    2-8-2-1 Features extraction Visual. 30 2-8-3 Cosine transform and LSDA. 31 2-8-3-1 Preprocessing. 31 2-8-3-2 DCT method. DCT + LDA 32 Diagnosis 39. Introduction 3-1 40 3-2 Detection of the lip area 41 3-2-1 Lip and skin color combination analysis 41 3-2-2 Color, saturation and light intensity (HSV) 42 3-2-3 Removing the red component 43. 3-2-4 Ka-Means Algorithm 43 3-2-4-1 Algorithm Implementation 44 3-2-5 Illumination Intensity and Binary 45 3-2-6 Combined Methods 45 3-3 Classification and Identification Methods 47

    3-3-1 neural network. 47

    3-3-1-1 feedforward networks. 48

    3-3-1-2 error back propagation algorithm. 48

    3-3-2 hidden Markov model. 51 4-1 database Removing red color. 56

             4-3-3 Analysis of lip and skin color combination.

    4-4-2 Windowing 62

    5-4-4 Calculating frequency Mel coefficients 65 4-5 Finding the center of the lip and extracting an area around the lip 66 4-5-1 Zigzag scan 67 4-5-2 Feature reduction with LSDA 4-5-2-1 Using the Logsigmoid function and changing the algorithm Training.70

               4-5-2-2 Using Tansigmoid function and momentum algorithm.70

       4-6 Extracting features from different images

            4-6-1 Extracting features from new images

    4-7 Reducing the number of frames and reducing the size of images. 73

    73 4-7-3 Reducing the number of frames and reducing the size of images with the resize command

    Table 1-1 of the grouping of idioms in English. 3

    Table 1-2 of the grouping of idioms in Persian language. 3

    Table 4-1 of monosyllabic words in the database. 52

    Table 4-2 of the results before adjusting the endpoints. 71

    Table 3-4, the results after setting the end points. 71

    Table 4-4, the results of the features extracted from the original images with 20 frames. 74

    Table 5-4, the results of the features extracted from the normalized images with relation (4-7) with 20 frames. 74

    Table 6-4, the results of Features extracted from reduced images with 20 frames. 75. Table 4-7, the results of the first 10 coefficients of the DCT coefficients of the original images with 20 frames. 75. Table 4-8, the results of the first 10 coefficients of the DCT coefficients of the normalized images with 20 frames. 76. Table 4-9, the results of the 10 coefficients. First, DCT coefficients of reduced images with 20 frames. 76 13.

    Figure 2-3 point distribution model, each state is drawn with ?2 ± around the average. 14

    Figure 2-4 geometric model of the lip. 16

    Figure 2-5 lip pattern. 19

    Figure 2-6 Manifold production process. 25

    Figure 2-7 (a) Manifold interpolation result (b) Re-sampling of the interpolated manifold with 20 key points. 26

    Figure 2-8 Block diagram for network-based motion feature extraction. 28

    Figure 2-9 Contour-based motion feature extraction 29.

    Figure 2-10 The original image and four regions processed for feature extraction. 30

    Figure 2-11 (a) Points with similar color and shape are placed in a class. (b) An intraclass graph connects points with the same label. (c) An interclass graph connects points with different labels. (d) After applying LSDA, the distance between different classes has been maximized. 33

    Figure 2-12 The left side of the Bezier curve and the right side of the lip model. 36

    Figure 2-13 The horizontal opening angle 2? and the vertical opening angle 1?. 38

    Figure 3-1 The result of the analysis of skin and lip color combination and the lip corner points. 42

    Figure 3-2 Algorithm for separation of lip region

    Figure 4-1 Thresholding with threshold 0.4.55

    Figure 4-2 Thresholding with threshold 0.5.55

    Figure 4-3 Using red color removal algorithm with ?=0.5 .56

    Figure 4-4 pictures of speakers. 57

    Figure 4-5 of the extracted lip shape after applying the algorithm. 58

    Figure 4-6 of the extracted lip shape after labeling. 59

    Figure 4-7 The rectangle surrounding the lip. 60

    Figure 4-8 steps of calculating the Mel coefficients. 61

    Figure 4-9 triangular bank filter. 63

    Figure 4-10 The target area around the lip. 66

    Figure 4-11 Number of 25 frames related to the word bear after finding the desired area. 67

    Figure 4-12 how to scan zigzag matrix. 68

    Figure 4-13 The results of features + LSDA.70

    Figure 4-14 The results of reduced images with a scale of 0.5 and the number of 25 frames. 77

    Figure 4-15 The results of reduced images with a scale of 0.7 and the number of 25 frames. 78

    Figure 16-4, the results of different DCT coefficients with a scale of 0.5. 79

    Figure 4-17 The results of different DCT coefficients with a scale of 0.7. 80

    Source:

     

    [1] T Chen, ''Audiovisual speech processing''. IEEE Signal Processing Magazine, Vol.18(1), pp: 9–21, (2001).   

    [2] Sadeghi, Vahida Al-Sadat, "Vowel Recognition in Persian Monosyllabic and Bisyllabic Words," Master's Thesis, Semnan University, 1385

    [3] E.D.Petajan, "Automatic Lipreading to Enhance Speech Recognition," PhD thesis, University of Illinois at Urbana-Champain, 1984.

    [4] M. Kass, A.Witkin, and Terzopoulos, "Snakes: Active Contour Models," International Journal of Computer Vision, pp.321-331,1988.

    [5] C. Bregler and Y. Konig, "Eigenlips For Robust Speech Recognition," in Proc. IEEE Conf.

Extraction of time-frequency feature for visual identification of Persian vowels