Word Files
Reference for Downloading Educational Files

Extraction of time-frequency feature for visual identification of Persian vowels

Number of pages: 107 File Format: word File Code: 30928
Year: 2013 University Degree: Master's degree Category: Electronic Engineering

Tags/Keywords: frequency - lip reading - Neural networks - Reducing feature dimensions - Time features - Vowel identification

Part of the Content
Contents & Resources

Summary of Extraction of time-frequency feature for visual identification of Persian vowels

Master's Thesis in Electrical Engineering, Electronics Orientation

Abstract

In this thesis, a method for identifying Persian vowels in monosyllabic words is presented. For this purpose, after separating the image frames and selecting the frames that were related to the pronunciation of the vowel in the monosyllabic word, as well as extracting the area around the lips, various features such as cosine coefficients, wavelet coefficients, and MFCC coefficients were extracted to recognize the vowels in monosyllabic words. After that, we reduced the features by the LSDA feature reduction method and changed the feature size to 25. Finally, the most effective features for identification were determined. In this research, the database containing monosyllabic words, which were spoken by different speakers and included 580 videos, was used. We used 381 videos for training and 199 videos for testing. The extracted features were applied as input to a two-layer neural network with 20 neurons in the middle layer and one neuron in the output. We used sigmoid tangent activation function in the middle layer and linear function in the output and we used gradient descent method with variable training rate to train the network. The best recognition rate was 95.75, which was obtained from the calculation of MFCC coefficients from 1.4 of the vector of DCT coefficients after scanning the zigzag matrix of cosine coefficients.

Key words:

lip reading, vowel recognition, time-frequency features, feature dimension reduction, neural networks. The speaker should pay attention while speaking and pronouncing the words. Probably all of us unconsciously use this non-sound aspect of speech to some extent and when the listening environment is full of noise and noise, we pay more attention to the speaker's lip movements. This matter is more important for those who have defects in their hearing system. In addition, lip movements or visual speech signals can significantly improve the accuracy of audio speech recognition systems, especially in noisy environments. Synchronizing lip movements and speech sound, eliminating the delay error between sound and image, and automatic video dubbing are other applications of this category.

There are many people who have damage to the audio system and are unable to communicate with others due to the lack of proper voice. Human speech has been repeated many times in audio and visual form in nature. Audio speech refers to the waveform produced by the speaker and visual speech refers to the movements of the lips and tongue and the muscles in the face. In vocal speech, the main unit is called phoneme [1]. In the visual domain, the main unit of mouth movements is called vishm[2], which is the smallest visual component of speech. There are many audio sounds that are visually ambiguous. These sounds are grouped into a similar class that represents a vise. There is a one-to-one mapping between faces and visimas, that is, it is possible to consider sets of faces that have a similar effect on the shape of the mouth. In the following tables, the grouping of idioms in English and Farsi is given [1], [2].
10

th,dh

3

E

11

t,d

4

I

12

k,g

5

O

13

sh,zh

6

U

14

s, z

7

Table 1-2 grouping of visages in Persian language

1. F, and

5. R

9. A

2. Th, Q, S, Z, Z, Z, Z

6. C, C, G, K, N, T, D, Y, T

10. ?

3. Zh, Sh

7. E

11. ?

4. B, P, M

8. 12. O

In general, there are three methods for speech recognition, including voice recognition of speech [3], visual recognition of speech [4], audio and visual recognition of speech [5], which is discussed in this research.In the first chapter, the introduction about speech recognition was stated. In the second chapter, the research conducted in the field of visual recognition of speech and different methods for doing this work has been discussed. In the third chapter, different methods of separating the mouth from the rest of the face have been introduced, so that by using these methods, we can avoid the complexity and large dimensions of the features in addition to reducing the size of the images. In the fourth chapter, the method of calculating and extracting the frequency-time characteristics of the desired area of ??the mouth from different video frames and their performance by changing the number of selected frames and the size of the images with one of the feature reduction methods is also examined. that these extracted features have been applied to the neural network for diagnosis and also the database that we used in this research has been introduced.

2-1 Introduction

Image recognition of speech or in other words, lip reading consists of two parts, firstly extracting features from lip images and then classifying (classifying) the features is Two image-based and model-based methods can be used to extract image features. In the image-based method, the features are extracted directly by applying mathematical transformations such as Fourier transform [6], wavelet transform [7], discrete cosine transform [8], special component analysis [9], linear component analysis [10] on the images. The problem of these methods is the large size and repetitiveness of the data and the sensitivity to rotation and displacement of the lip. In the model-based method, a model of the lip is created and described by a small set of parameters, such as active shape models [11], active boundary models [12], flexible models [13], the advantage of this method is to express the features in small dimensions and the model is unaffected by the image brightness, rotation, size and displacement of the lip. The contour is active. Petajan [14] was probably the first researcher to develop the lip reading system [3]. The active boundary model is modeled by an open or closed curve with a number of control points near the image of the object whose shape we want to extract. For its formability, several energy factors are considered and by minimizing these energies, the curve takes the necessary form. This model was introduced by Gass and his colleagues [4], who named this model snake because of the similarity of the contour movement [15] to snake crawling [16]. The snake can be expressed by a number of points, internal elastic energy [17] or energy based on the outer edge. 2-2-1 Energy function A snake can be represented by n points as Vi= (xi, yi), i=0, 1, 2, ...., n-1. The energy function of the snake is expressed as follows. E*snake= (V(s)) ds= (V (s)) + e image (v (S)) + e con (v (s))) DS

Relationship (2-1)

Relationship (2-2) E External = EIMage + E con

Relationship (2-3) ENTRAL = E cont + e Curv

The external energy is composed of the sum of the image energy and the energy of the external restriction [18] that is applied by the user. The internal energy is the sum of the snake's contour energy and the snake's bending energy [19]. (s)||2)/2

Large values ??of (s) ? and (s) ? will increase the internal energy of the snake when it expands too much, and their small values ??will put less restrictions on the size and shape of the snake.
Contents & References of Extraction of time-frequency feature for visual identification of Persian vowels

List:

Chapter One: Introduction 1

1-1 Introduction ..2

1-2 Structure of Thesis 4

Chapter Two: Review of Research 5

2-1 Introduction ..6

2-2 Active Frontier Models 6

2-2-1 Energy function 7. 2-2-2 Energy minimization 9 2-3 ??Active form models 12 2-4 Flexible models 16 2-4-1 Lip model 16 2-4-2 Cost function formulation 17
         2-4-3 Optimizing model parameters. 2-7 Special component analysis. 23

2-7-1 Mathematical background of EM-PCA. 24

2-7-2 Manifold production from input image. 24

2-8 Discrete cosine transform. 26

2-8-1 Modeling based on 3-D DCT. 26

        2-8-1-1 Lip movement feature extraction. 27

2-8-1-2 Network-based movement feature extraction. 27

2-8-1-3 Contour-based movement feature extraction. 28

2-8-2 Feature extraction from the target area. 29

2-8-2-1 Features extraction Visual. 30 2-8-3 Cosine transform and LSDA. 31 2-8-3-1 Preprocessing. 31 2-8-3-2 DCT method. DCT + LDA 32 Diagnosis 39. Introduction 3-1 40 3-2 Detection of the lip area 41 3-2-1 Lip and skin color combination analysis 41 3-2-2 Color, saturation and light intensity (HSV) 42 3-2-3 Removing the red component 43. 3-2-4 Ka-Means Algorithm 43 3-2-4-1 Algorithm Implementation 44 3-2-5 Illumination Intensity and Binary 45 3-2-6 Combined Methods 45 3-3 Classification and Identification Methods 47

3-3-1 neural network. 47

3-3-1-1 feedforward networks. 48

3-3-1-2 error back propagation algorithm. 48

3-3-2 hidden Markov model. 51 4-1 database Removing red color. 56

         4-3-3 Analysis of lip and skin color combination.
4-4-2 Windowing 62
5-4-4 Calculating frequency Mel coefficients 65 4-5 Finding the center of the lip and extracting an area around the lip 66 4-5-1 Zigzag scan 67 4-5-2 Feature reduction with LSDA 4-5-2-1 Using the Logsigmoid function and changing the algorithm Training.70

           4-5-2-2 Using Tansigmoid function and momentum algorithm.70

   4-6 Extracting features from different images

        4-6-1 Extracting features from new images

4-7 Reducing the number of frames and reducing the size of images. 73

73 4-7-3 Reducing the number of frames and reducing the size of images with the resize command
Table 1-1 of the grouping of idioms in English. 3

Table 1-2 of the grouping of idioms in Persian language. 3

Table 4-1 of monosyllabic words in the database. 52

Table 4-2 of the results before adjusting the endpoints. 71

Table 3-4, the results after setting the end points. 71

Table 4-4, the results of the features extracted from the original images with 20 frames. 74

Table 5-4, the results of the features extracted from the normalized images with relation (4-7) with 20 frames. 74

Table 6-4, the results of Features extracted from reduced images with 20 frames. 75. Table 4-7, the results of the first 10 coefficients of the DCT coefficients of the original images with 20 frames. 75. Table 4-8, the results of the first 10 coefficients of the DCT coefficients of the normalized images with 20 frames. 76. Table 4-9, the results of the 10 coefficients. First, DCT coefficients of reduced images with 20 frames. 76 13.

Figure 2-3 point distribution model, each state is drawn with ?2 ± around the average. 14

Figure 2-4 geometric model of the lip. 16

Figure 2-5 lip pattern. 19

Figure 2-6 Manifold production process. 25

Figure 2-7 (a) Manifold interpolation result (b) Re-sampling of the interpolated manifold with 20 key points. 26

Figure 2-8 Block diagram for network-based motion feature extraction. 28

Figure 2-9 Contour-based motion feature extraction 29.

Figure 2-10 The original image and four regions processed for feature extraction. 30

Figure 2-11 (a) Points with similar color and shape are placed in a class. (b) An intraclass graph connects points with the same label. (c) An interclass graph connects points with different labels. (d) After applying LSDA, the distance between different classes has been maximized. 33

Figure 2-12 The left side of the Bezier curve and the right side of the lip model. 36

Figure 2-13 The horizontal opening angle 2? and the vertical opening angle 1?. 38

Figure 3-1 The result of the analysis of skin and lip color combination and the lip corner points. 42

Figure 3-2 Algorithm for separation of lip region

Figure 4-1 Thresholding with threshold 0.4.55

Figure 4-2 Thresholding with threshold 0.5.55

Figure 4-3 Using red color removal algorithm with ?=0.5 .56

Figure 4-4 pictures of speakers. 57

Figure 4-5 of the extracted lip shape after applying the algorithm. 58

Figure 4-6 of the extracted lip shape after labeling. 59

Figure 4-7 The rectangle surrounding the lip. 60

Figure 4-8 steps of calculating the Mel coefficients. 61

Figure 4-9 triangular bank filter. 63

Figure 4-10 The target area around the lip. 66

Figure 4-11 Number of 25 frames related to the word bear after finding the desired area. 67

Figure 4-12 how to scan zigzag matrix. 68

Figure 4-13 The results of features + LSDA.70

Figure 4-14 The results of reduced images with a scale of 0.5 and the number of 25 frames. 77

Figure 4-15 The results of reduced images with a scale of 0.7 and the number of 25 frames. 78

Figure 16-4, the results of different DCT coefficients with a scale of 0.5. 79

Figure 4-17 The results of different DCT coefficients with a scale of 0.7. 80

Source:

[1] T Chen, ''Audiovisual speech processing''. IEEE Signal Processing Magazine, Vol.18(1), pp: 9–21, (2001).

[2] Sadeghi, Vahida Al-Sadat, "Vowel Recognition in Persian Monosyllabic and Bisyllabic Words," Master's Thesis, Semnan University, 1385

[3] E.D.Petajan, "Automatic Lipreading to Enhance Speech Recognition," PhD thesis, University of Illinois at Urbana-Champain, 1984.

[4] M. Kass, A.Witkin, and Terzopoulos, "Snakes: Active Contour Models," International Journal of Computer Vision, pp.321-331,1988.

[5] C. Bregler and Y. Konig, "Eigenlips For Robust Speech Recognition," in Proc. IEEE Conf.

How To Access The File

Extracting time-frequency feature for visual identification of Persian vowels

Number of pages: 102 Category: Electronic Engineering

Master's thesis in electrical engineering, electronics major. Abstract: In this thesis, a method for identifying Persian vowels in monosyllabic words is presented. For this purpose, after separating the image frames and selecting the frames that were related to the pronunciation of the vowel in the monosyllabic word, as well as extracting the area around the lips, various ...

Detecting the rate of failure to detect the sudden change of variables in the Tennessee Eastman process using neural networks

Number of pages: 124 Category: Electrical Engineering

Abstract Changing the parameters in an industrial process causes the process to go out of its optimal working point. This change, in turn, will reduce the efficiency of closed-loop controllers that are designed for the optimal operating point of the system. Therefore, it is necessary to first detect and identify these changes as a defect and then correct the system behavior by ...

Increasing the accuracy of face recognition by selecting the optimal subset of face features using the cuckoo algorithm

Number of pages: 75 Category: Computer Engineering

Dissertation for Master's Degree in Computer Science Abstract Nowadays, in many fields, we need devices that can identify people's computers and recognize them based on their body characteristics. Face recognition system as a biometric system is basically a pattern recognition system that recognizes a person based on the vector of specific physiological characteristics or ...

Review and evaluation of Monte Carlo algorithms and neural networks to predict air pollution in the environment of a spatial information system

Number of pages: 139 Category: Biology - Environment

Master's thesis abstract shows the necessity of having a healthy environment and raising the health level of the society, the need to have proper planning to reduce the sources of air pollutants production and predicting these pollutants to prevent its harmful effects is inevitable. Prediction of pollutants can be useful in air pollution management and control. In this research, ...

Designing a new scheduling algorithm for real-time and non-real-time users in LTE networks

Number of pages: 74 Category: Electrical Engineering

Master's Thesis of Electrical Engineering - Telecommunications Abstract With the rapid growth of Internet users and real-time services such as voice and video and the need to meet the quality of service required by users, the next generation of cellular networks is expected to provide access to mobile users everywhere. LTE is a new radio access technology proposed for a move ...

Investigation of thermal effects on MEMS-based PLL and its compensation

Number of pages: 80 Category: Electrical Engineering

Dissertation for Master's Degree in Mechatronics Abstract In this thesis, a phase-locked loop is designed based on micro-electromechanical systems. The loop system has a feedback phase lock that compares the input phase with the output phase. This comparison is done by a phase detector. Phase detector is a circuit whose average output voltage is linearly proportional to the ...

Speaker recognition in multi-speaker environment using support vector machine

Number of pages: 117 Category: Electronic Engineering

Electronics Department of Master's Thesis Abstract: Speaker identification is one of the topics discussed in speech processing. Speaker identification is the process of identifying who is really speaking and when using the speech signal. The goal is to design a system that can identify the change in the speaker and tag each speaker's speech for the system. It means to specify ...

Dematting face images for use in a face recognition system

Number of pages: 75 Category: Electronic Engineering

The master's thesis for obtaining a master's degree is an abstract of facial recognition in the fields of biometrics, machine vision, and pattern recognition, and has a wide range of applications, including issues related to security systems. Since various factors such as ambient lighting, noise, and image opacity are more or less effective in the performance of face recognition ...

Speaker recognition in multi-speaker environment using support vector machine

Number of pages: 118 Category: Electronic Engineering

Dematting face images for use in a face recognition system

Number of pages: 73 Category: Electronic Engineering

Abstract Face recognition is in the fields of biometrics, machine vision and pattern recognition and has a wide range of applications including issues related to security systems. Since various factors such as ambient lighting, noise, and image opacity are more or less effective in the performance of face recognition methods, therefore, investigating the methods of removing blur ...

Extraction of time-frequency feature for visual identification of Persian vowels

Summary of Extraction of time-frequency feature for visual identification of Persian vowels

Contents & References of Extraction of time-frequency feature for visual identification of Persian vowels