Word Files
Reference for Downloading Educational Files

Extracting time-frequency feature for visual identification of Persian vowels

Number of pages: 102 File Format: word File Code: 32129
Year: 2013 University Degree: Master's degree Category: Electronic Engineering

Tags/Keywords: Feature dimensions - lip reading - Neural networks - Persian vowels - Speech audio and video recognition - Time features - Visual recognition - Visual speech recognition - Vowel identification

Part of the Content
Contents & Resources

Summary of Extracting time-frequency feature for visual identification of Persian vowels

Master's Thesis in Electrical Engineering, Electronics Orientation

Abstract

In this thesis, a method for identifying Persian vowels in monosyllabic words is presented. For this purpose, after separating the image frames and selecting the frames that were related to the pronunciation of the vowel in the monosyllabic word, as well as extracting the area around the lips, various features such as cosine coefficients, wavelet coefficients, and MFCC coefficients were extracted to recognize the vowels in monosyllabic words. After that, we reduced the features by the LSDA feature reduction method and changed the feature size to 25. Finally, the most effective features for identification were determined. In this research, the database containing monosyllabic words, which were spoken by different speakers and included 580 videos, was used. We used 381 videos for training and 199 videos for testing. The extracted features were applied as input to a two-layer neural network with 20 neurons in the middle layer and one neuron in the output. We used sigmoid tangent activation function in the middle layer and linear function in the output and we used gradient descent method with variable training rate to train the network. The best recognition rate was 95.75, which was obtained from the calculation of MFCC coefficients from 1/4 of the vector of DCT coefficients after zigzag scanning of the cosine coefficients matrix.

Key words:

lip reading, vowel recognition, time-frequency features, feature dimensionality reduction, neural networks

1 Introduction

Mankind has been familiar with the fact that for a better understanding of speech, he can pay attention to the movements of the speaker's lips and mouth during speech and when pronouncing words. Probably all of us unconsciously use this non-sound aspect of speech to some extent and when the listening environment is full of noise and noise, we pay more attention to the speaker's lip movements. This matter is more important for those who have defects in their hearing system. In addition, lip movements or visual speech signals can significantly improve the accuracy of audio speech recognition systems, especially in noisy environments. Synchronizing lip movements and speech sound, eliminating the delay error between sound and image, and automatic video dubbing are other applications of this category.

There are many people who have damage to the audio system and are unable to communicate with others due to the lack of a proper voice. Human speech has been repeated many times in audio and visual form in nature. Audio speech refers to the waveform produced by the speaker and visual speech refers to the movements of the lips and tongue and the muscles in the face. In vocal speech, the main unit is called phoneme [1]. In the visual domain, the main unit of mouth movements is called vishm[2], which is the smallest visual component of speech. There are many audio sounds that are visually ambiguous. These sounds are grouped into a similar class that represents a vise. There is a one-to-one mapping between faces and visimas, that is, it is possible to consider sets of faces that have a similar effect on the shape of the mouth. In the following tables, the grouping of visas in English and Farsi is given.

(The tables are available in the main file)

Generally, there are three methods for speech recognition, including speech recognition [1], speech recognition [2], speech recognition [3], and in this research speech recognition is discussed.

1-2 structure of the thesis

In different chapters of this thesis, visual speech recognition methods have been examined. In the first chapter, introductions about speech recognition were given. In the second chapter, the research conducted in the field of visual recognition of speech and different methods for doing this work has been discussed. In the third chapter, different methods of separating the mouth from the rest of the face have been introduced, so that by using these methods, in addition to reducing the size of the images, we can reduce the complexity and dimensions of the features.In the third chapter, different methods of separating the mouth from the rest of the face have been introduced, so that by using these methods, we can avoid the complexity and large dimensions of the features in addition to reducing the size of the images. In the fourth chapter, the method of calculating and extracting the frequency-time characteristics of the desired area of ??the mouth from different video frames and their performance by changing the number of selected frames and the size of the images with one of the feature reduction methods is also examined. that these extracted features have been applied to the neural network for recognition and also the database that we used in this research has been introduced. In this thesis time-frequency features extracted from the images of the speaker's mouth and extracted features are used as input parameters to a neural network system for recognition. Because we used the video images so we got to work a different number of video frames. First separated the frames manually and then selected the area around the mouth and desired features for the area of ??each frame obtained. To improve performance and reduce the dimensions of features, we used dimensionality reduction technique LSDA. Using this approach we have reduced the size of our feature. The database consists of different individuals, that have been uttered monosyllabic words 2 or 3 times. Finally, the vowel recognition rate of 95.75 was achieved.
Contents & References of Extracting time-frequency feature for visual identification of Persian vowels

List:

The first chapter: Introduction 1. Introduction 1-1 Introduction 2. 1-2 Thesis structure 4. The second chapter: An overview of the conducted researches 5. 2-1 Introduction 6. 2-2 Active boundary models 6. 2-2-1 Energy function 7. 2-2-2 Energy minimization 9. 2-3 Active shape models 12. 2-4 Flexible models 2-4-1 lip model 21

         2-6-2 color transformation.

2-8 Discrete cosine transform 26 2-8-1 Modeling based on 3-D DCT 26 2-8-1-1 Extraction of lip motion feature 28. 2-8-2 Feature extraction from the target area. 29 2-8-2-1 Extraction of visual features. 30 2-8-3 Cosine transform and LSDA. 31 2-8-3-1 Preprocessing. 2-8-3-2 DCT method.31 2-8-3-3 DCT + PCA.31 2-8-3-4 DCT +LDA 32 Bezier curve. 35

2-10 separation of the lip area with Cam-Meniz. 37

Chapter three: Mouth area extraction methods and detection systems 39 3-1 Introduction 40 3-2 Lip area detection 41 3-2-1 Lip and skin color composition analysis 41 3-2-2 Color and saturation and light intensity (HSV) 42 3-2-3 Removal Red component 43 3-2-4 Cummins algorithm 43 3-2-4-1 Algorithm implementation 44 3-2-5 Brightness and binarization 45 3-2-6 Combined methods 45 3-3 Classification and identification methods 47

3-3-1 neural network 47 3-3-1-1 feedforward networks 48 3-3-1-2 error back propagation algorithm 48 3-3-2 hidden Markov model Chapter 4: Extracting features and implementation of the proposed method and introducing the database. 51

4-1 Database. 52

4-1-1 Separation of recorded videos. 53

4-2 Extracted features. 54

4-4-1 Framing 61 4-4-2 Windowing 62 4-4-3 Discrete Fourier transform 62 4-4-4 Mel scale 62 4-4-5 Discrete cosine transform 64 4-4-5-1 Calculation Cosine and Violet coefficients. 65

4-4-5-2 Calculating Mel-frequency coefficients. 65

4-5 Finding the center of the lip and extracting an area around the lip. 66

4-5-1 Zigzag scan. 67

4-5-2-1 Using the Logsigmoid function and changing the training algorithm 70 4-5-2-2 Using the Tansigmoid function and the momentum algorithm 4-6 Extracting features from different images 4-6-1 Extracting features from new images 4-6-2 Mel coefficients Frequency and cosine coefficients. 72

4-7 Reducing the number of frames and reducing the size of images. 73

        4-7-1 Calculating MFCC coefficients

73 4-7-3 Reducing the number of frames and reducing the size of images with the resize command
[1] T Chen, "Audiovisual speech processing". IEEE Signal Processing Magazine, Vol.18(1), pp: 9–21, (2001).

[2] Sadeghi, Vahida Al-Sadat, "Vowel Recognition in Persian Monosyllabic and Bisyllabic Words," Master's Thesis, Semnan University, 1385

[3] E.D.Petajan, "Automatic Lipreading to Enhance Speech Recognition," PhD thesis, University of Illinois at Urbana-Champain, 1984.

[4] M. Kass, A.Witkin, and Terzopoulos, "Snakes: Active Contour Models," International Journal of Computer Vision, pp.321-331,1988.

[5] C. Bregler and Y. Konig, "Eigenlips For Robust Speech Recognition," in Proc. IEEE Conf. Acoustics, Speech and Signal Processing, pp.669-672, 1994.

[6] Takeshi Saitoh and Ryosuke Konishi, "Word Recognition based on Two Dimensional Lip Motion Trajectory," international Symposium on Intelligent Signal Processing and Communication System (ISPACS2006), pp.287-290. 12-15 Dec, 2006

[7] Mir Hadi Seyed Arabi, Ali Agha Golzadeh, Sohrab Khan Mohammadi, "Automatic tracking of lip movements and its special points using active contour", 14th Iranian Electrical Engineering Conference 2006 ICEE.

[8] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, "Active Shape Models-Their Training and Application," Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38-59, Jan. 1995

[9] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, "Extraction of visual features for lipreading," IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 213, Feb. 2002.

, S.H.Leung, et al. "A real-time automatic lipreading system,"

International Symposium on Circuits and Systems, No.2, pp.101-104, IEEE, Vancouver, Canada, May 2004.

[12] D. Thambiratnam, T. Wark, S. Sridharan and V. Chandran, "Speech Recognition in Adverse Environments using Lip Information," Speech and Image Technologies for Computing and Telecommunications, IEEE TENCON 1997, Vol.1, pp.149-152, 4Dec,1997

[13] Tanveer A Faruquie, Abhik Majumdar, Nitendra Rajput, L V Subramaniam,"Large Vocabulary Audio-Visual Speech Recognition Using Active Shape Models," Pattern Recognition, 2000, 15th International Conference, Vol.3, pp.106-109,2000.

[14] A.L.Liew, et al,"Lip contour extraction from color images using a deformable model," The Journal of the Pattern Recognition Society, No.35, 2949-2962, 2002

[15] Stefan Horbelt, Jean-Luc Dugelay, "Active Contours For Lipreading Combining With Templates," 15th GRETST Symposium on Signal and Image processing, pp.18-22, September 1995, France.

[16] Mohammad Mehdi Hosseini, Abdorreza Alavi Gharahbagh and Sedigheh Ghofrani, "Vowel Recognition by Using the Combination of Haar Wavelet and Neural Network," KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems, Part I, pp.331-339, 2010.

[17] M.M, Hosseini, S.Ghofrani, "Automatic Lip Extraction Based On Wavelet Transform," IEEE GCIS, pp.393-396, 2009, China.

[18] Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan, "A PCA based Manifold Representation for Visual Speech Recognition," In: CIICT 2007, Proceedings of the China-Ireland International Conference on Information and Communication Technologies, 28-29 August 2007, Dublin, Ireland.

[19] Y. L. Tian and T. Kanade," Robust Lip Tracking by Combining Shape, Color and Motion," Proc. of the Asian Conference on Computer Vision, pp.1040-1045, 2000.

How To Access The File

Extraction of time-frequency feature for visual identification of Persian vowels

Number of pages: 107 Category: Electronic Engineering

Master's thesis in electrical engineering, electronics major. Abstract: In this thesis, a method for identifying Persian vowels in monosyllabic words is presented. For this purpose, after separating the image frames and selecting the frames that were related to the pronunciation of the vowel in the monosyllabic word, as well as extracting the area around the lips, various ...

Increasing the accuracy of face recognition by selecting the optimal subset of face features using the cuckoo algorithm

Number of pages: 75 Category: Computer Engineering

Dissertation for Master's Degree in Computer Science Abstract Nowadays, in many fields, we need devices that can identify people's computers and recognize them based on their body characteristics. Face recognition system as a biometric system is basically a pattern recognition system that recognizes a person based on the vector of specific physiological characteristics or ...

Speaker recognition in multi-speaker environment using support vector machine

Number of pages: 117 Category: Electronic Engineering

Electronics Department of Master's Thesis Abstract: Speaker identification is one of the topics discussed in speech processing. Speaker identification is the process of identifying who is really speaking and when using the speech signal. The goal is to design a system that can identify the change in the speaker and tag each speaker's speech for the system. It means to specify ...

Comparison of visual-spatial ability of students with non-verbal learning disorder and high-performing independent students

Number of pages: 63 Category: Educational Sciences

Master's thesis in the field of psychology and education of exceptional children. The present study was conducted in order to compare the visual-spatial ability of children with non-verbal learning disorder and high-functioning independent children. In this research, a descriptive type of contextualization was used, which was implemented in two stages of selection (subjects and ...

Presenting an efficient model based on the subcombinations extracted from the feature to recognize human physical activities

Number of pages: 140 Category: Computer Engineering

Doctoral thesis in the field of computer engineering (artificial intelligence) Abstract Understanding and extracting information from images and videos is the common thread of the majority of problems related to machine vision. Finding the main and useful parts of a movie and modeling the actions between these parts is one of the main goals of movie analysis. In the last decade, ...

Dematting face images for use in a face recognition system

Number of pages: 75 Category: Electronic Engineering

The master's thesis for obtaining a master's degree is an abstract of facial recognition in the fields of biometrics, machine vision, and pattern recognition, and has a wide range of applications, including issues related to security systems. Since various factors such as ambient lighting, noise, and image opacity are more or less effective in the performance of face recognition ...

Speaker recognition in multi-speaker environment using support vector machine

Number of pages: 118 Category: Electronic Engineering

Detecting subtle facial expressions using the Euler motion magnification method

Number of pages: 118 Category: Computer Engineering

Master's Thesis of Faculty of Technology and Engineering, Department of Computer Abstract: In this thesis, a new method is presented to detect the subtle emotional states of the face. Euler's video zoom method has been used to reveal subtle facial movements. Euler's video zoom method enhances the subtle movements of signals (color change or transition motion) by temporal and ...

Detection of non-ideal iris images based on meta-heuristic algorithms

Number of pages: 71 Category: Electronic Engineering

Dissertation for Master's Degree in Mechatronics Engineering (M.Sc) Abstract Biometric technology, based on the unique characteristics of each person, automatically recognizes people's identity. Researchers have been able to extract the iris tissue with high precision even in different conditions. As a result, our effort in this thesis is to present views and methods that ...

Detecting the rate of failure to detect the sudden change of variables in the Tennessee Eastman process using neural networks

Number of pages: 124 Category: Electrical Engineering

Abstract Changing the parameters in an industrial process causes the process to go out of its optimal working point. This change, in turn, will reduce the efficiency of closed-loop controllers that are designed for the optimal operating point of the system. Therefore, it is necessary to first detect and identify these changes as a defect and then correct the system behavior by ...

Extracting time-frequency feature for visual identification of Persian vowels

Summary of Extracting time-frequency feature for visual identification of Persian vowels

Contents & References of Extracting time-frequency feature for visual identification of Persian vowels