Speaker recognition in multi-speaker environment using support vector machine

Number of pages: 117 File Format: word File Code: 31352
Year: 2011 University Degree: Master's degree Category: Electronic Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Speaker recognition in multi-speaker environment using support vector machine

    Electronics group

    Master thesis

    Abstract:

    Speaker identification is one of the topics discussed in speech processing. Speaker identification is the process of identifying who is really speaking and when using the speech signal. The goal is to design a system that can identify the change in the speaker and tag each speaker's speech for the system. It means to specify which speaker spoke in which intervals. Today, this practice has been popularized by a new title that encompasses both the process of separation and labeling called Speaker Diarization. The purpose of segmentation is to divide the speech signal into parts that only contain the speech of one speaker, and the purpose of clustering is to identify the speech parts of a speaker and assign a single label to them. The purpose of this thesis is to design and implement a speaker segmentation and clustering system using new algorithms and also to improve the results of these algorithms for this issue. This system must correctly recognize the change points of the speaker without knowing the previous information about the speaker and finally places all the audio parts related to a speaker in one cluster.

    In this dissertation, the speaker recognition system consists of three main stages. In the first step, the non-speech parts are removed from the speech parts of the audio file, in order to increase the accuracy and speed of the system operation in the next steps. Then the speech file is divided into homogeneous parts in which there is only one speaker's speech. In the third step, using appropriate clustering, the speech parts of the previous step, which belong to a speaker, are placed in a cluster. To implement the system, four types of MFCC feature vectors root-MFCC, TDC, and root-TDC and three types of databases were used, and the accuracy of the segmentation stage was 80%, and the accuracy of the clustering stage was 59% using the support vector machine.

    Keywords:

    Speaker statistical segmentation

    Speaker segmentation

    Voice segments recognition

    Clustering Speakers

    Introduction

    Today, multimedia data covers a significant part of human knowledge. The amount of multimedia files archived in various institutions has increased significantly in recent years. The accessibility and clarity of these files can be of great help to people who are looking for information. Therefore, searching and retrieving information in this high volume is a task that requires a computer system. And as a result, one of the research areas that has recently received attention is related to the structuring of multimedia files. Among these data, voice information is more important. Because most of the archives contain audio data from TV and radio reports as well as phone conversations. In recent years, extensive research has been started in this field and acceptable results have been obtained. Among the other uses of this field in identifying the guilty, separating the important words of a witness or accused in the court and so on. It can be mentioned.

    In the audio application, the main information in the files are the conversations of a number of speakers, and the purpose of the final system is to answer the question of who spoke at what times? Different parts of this research field have been called by different names, such as speaker segmentation [1], speaker recognition [2], robust transcription [3], and speaker indexing [4]. Such systems are used for easy movement of audio data in long audio files (such as: news, meetings and meetings of a company, etc.) that belong to several speakers. Long radio conversations and calculations are environments in which several speakers are present and talk to each other. The ultimate goal of such systems is to implement appropriate methods to distribute audio files to areas where a particular speaker has spoken. Easy access to parts of a speaker's speech is provided by this system. With a large volume of audio data, the importance of these systems increases.

    With the increase in the number of text documents available on the Internet, the need for techniques such as text indexing in order to facilitate access and search in these documents increased. Similar to this need, with the increase in the number of audio documents such as lectures, interviews and gatherings.. Obviously, accessing audio documents is much more difficult than accessing text, and listening to a recorded audio file is more time-consuming than reading text, and manual indexing of audio documents is difficult compared to text indexing. The proposed solution to solve this problem is the automatic cataloging of audio documents [5]. In 2001, Pelkan and Sidharun and their group improved the results of the system by reducing the effect of noise on the signal and led to better speaker separation. In 2005, Boulian and Kenny obtained different results by using other feature vectors (or integrating previous methods) and using Gaussian models in the system. In 2005, Yamashita and Matsunaga improved the speaker segmentation results of this system by using audio signal features such as signal pitch frequency, energy, signal maximum frequencies, and three other features.[1] And in the following years, by performing different methods on its different parts, until today, these systems have been completed and the results have been improved.

    The aim of this thesis is to design and implement a system that can identify the change in the speaker in an audio file that includes the speech of several speakers and, as far as possible, categorize the speech of each speaker without knowing his previous information. This system can include two basic parts, which are: Speaker segmentation Speaker clustering The work of the segmentation [6] is to divide the speech signal into segments that only contain the speech of one speaker. In the clustering stage [7], the speech parts related to a speaker are identified and categorized and a single label is assigned to it. This article is used in many speech applications that are related to speech recognition or indexing[8] in an environment where several speakers may speak, such as a meeting, conference, news, and the like. This work can not only help advanced speech recognition systems to improve group recognition results, but also help them in identifying and transcribing conversations. As mentioned before, it is also possible to use it in audio indexing, which provides the possibility of searching audio files. Figure (1-1) shows how this system works.

    Figure (1-1): Display of speaker segmentation on input speech

    The audio file under review is a single-channel recorded audio that contains several audio sources. These audio sources are different and can include several speakers, music, types of noise, etc. to be The type and details of the audio resources in the file depend on the functional characteristics of that file. Generally, speaker segmentation and clustering systems are used in the following three areas: news data, recorded meetings, telephone conversations, as mentioned earlier, these three areas have differences such as the quality of audio recording (bandwidth, microphones, and noise) and the amount and type of resources. Non-speech has the number of speakers, style and structure of speech (duration of speech, order of speakers) and each field has its own issues and problems for speaker segmentation and clustering. Of course, speaker recognition systems try to achieve acceptable and suitable results for all three work areas.[1]

    The lower level of work of such a system is the classification of audio data into clusters, each of which belongs to a speaker. Here, it is easy to see that there are two observational [9] (with a supervisor) and non-supervisory [10] (without a supervisor) views in this section. In the first view, there is already information about who is speaking in the audio file. But in the second view, the work of the system is to classify the file into time intervals in which only one speaker, whose identity is hidden from us, speaks. It should be noted that the output of an unsupervised classification system can be used as an input to identification systems [11], and in this way we will have an observational classification system. Therefore, the efficiency as well as the execution time of the obtained monitoring system is better. On the other hand, the performance of these systems also depends on the amount of previous information allowed. This previous information can be speech samples from the speakers, the number of speakers in the audio file, or information about the structure of the recorded file. But in most speaker segmentation and clustering systems, it is assumed that there is no previous information about the speakers and their number.

  • Contents & References of Speaker recognition in multi-speaker environment using support vector machine

    List:

    Table of Contents

    Chapter One: Introduction of speaker recognition systems

    1-1-Introduction..2

    1-2-Different working stages of speaker recognition systems. 6

    1-2-1-Acoustic segment..7

    1-2-2-Speech recognition from other Speech... 8

    1-2-3-Speaker gender detection..9

    1-2-4-Speaker change detection..9

    1-3-Speaker segmentation and clustering methods. Model..11

        1-3-3-Hybrid or combined methods..11

    1-4-Clustering..11

    1-5-Summary..12

    Chapter Two: Recognition of speech from non-speech areas

    2-1-Introduction..14

    2-2- Structure of speech recognition from non-speech Speech. 16

    2-2-1- Preprocessing.. 16

    2-2-2- Feature extraction.. 17

    2-2-2-1- Energy.. 18

    2-2-2-2- Zero crossing rate.. 19

    2-2-2-3- Feature extraction with the help of Capstral frequency coefficients on the mel scale. 19

    2-2-2-4- LPC coefficients. 23

    2-2-2-5- entropy.. 24

    2-2-2-6- Intermittent size. Band..28

             2-2-2-8- Other parameters..28

     

         2-2-3- Threshold calculation..29

         2-2-4- VAD decisions.

             2-2-4-1- Decision based on hidden Markov model. 2-2-4-2- Decision making based on neural networks. 31

    2-2-5- Correction of VAD results. 33

    2-3- Block diagram of multiple VAD standards.. 33

    2-3-1- ETSI AMR standard. 33

    2-4-Summary..35

    Chapter three: Revealing speaker change

    3-1-Introduction..37

    3-2-Speaker segmentation..38

       3-2-1-Segmentation based on distance..38

        3-2-2-Segmentation based on model...40

        3-2-3-Hybrid segmentation..40

    3-3-Comparison of segmentation methods..40

    3-4-Common speaker detection methods..41

    3-4-2- Combination of T2 statistics and BIC.45

    3-4-2-1- More speed and benefit in T2-BIC segmentation. 47

    3-4-3- General likelihood rate interval (GLR..49

    3-4-4- KL distance 2.49

    3-4-5- Change detection Speaker using DSD. 51. 3-4-6- Cross-BIC (XBIC)...52

    3-4-7- Estimation of Gaussian mixture model. (GMM-L). 53

    3-5- Summary.. 53

    Chapter Four: Classification Methods

    4-1-Introduction..55

    4-2-Clustering System Components..56

    4-3-Clustering Methods..57

     4-3-1-Hierarchical Clustering Methods..58

        4-3-1-1-Techniques Ascending clustering.. 59

             4-3-1-2-Descending clustering techniques. 60

    4-3-2- Ascending clustering methods.. 61

    4-4- Common clustering methods in speaker clustering systems. 61

    4-5- Support vector machine classifier.. 63

         4-5-1- Linear support vector machine classifier. 63

    4-5-1-1- Classification of separable classes. 4-5-2- Non-linear support vector machines..72

    4-6- Summary..74

    Chapter five: Implementation and observations of the proposed hybrid system

    5-1-Introduction..76

    5-2- Structure of the implemented system..77

    5-3- Database..80

    5-4-Feature extraction..82

    5-5-Evaluation criteria of speaker recognition systems..84

    5-6-Test results..88

    5-6-1- Effect of applying VAD on speech signal..88

    5-6-2- Effect of changing VAD window length on system accuracy.89

    5-6-3- The effect of changing the length of the BIC window on the segmentation results.Feature vector on the accuracy of the segmentation stage 93 5-6-6-Comparison of the results of the segmentation stage with the use of different feature vectors 95 5-6-7 The effect of gender, speakers, on the correct recognition of segmentation boundaries 96 5-6-8 Accuracy The clustering step of using the support vector machine (SVM) with the MFCC feature vector. 96

    5-6-9- The accuracy of the support vector machine clustering step using the root-MFCC feature vector. 97

    5-6-10- The effect of changing the kernel function type of the support vector machine on the accuracy of the clustering step. 98

    5-7-Summary. 98

    Chapter six: Total Classification and suggestions

    6-1-Summary and summary of results.100

    6-2-Suggestions.101

    Resources.

    Source:

    [1].Xavier.Anguera.Mir, Phd Thesis, "Robust Speaker Diarization for meetings", 2006.

    [2].L.Docio, C.Garcia, "Speaker Segmentation, detection and tracking in multi-speaker long audio recordings", Third COST275 Workshop Bimetrics on the internet. 2005.

    [3]. Janes.Zibert, B.Vesnicer, F.Mihelie, "A System for speaker detection and tracking in audio broadcast news", IEEE proceeding, pp.51-61, 2008.

    [4].A.F.Martin, M.A.Przybocki, "Speaker recognition in a multi-speaker environment", Euro speech 2001 Scandinavia, Coference on Speech Communication and Technology, 2001.

    [5]. R.O.Duda, P.E.Hart, D.G.Stork, "Pattern Classification", John Wiley and sons, 2nd edition, 2007.

    [6]. Christopher M. Bishop, "Pattern Recognition and Machine learning", pp.738, Springer 2006.

    [7]. M.A.Siegler,U.Jain,B.Raj, M.Stern, "Automatic Segmentation, Classification and Clustering of Broadcast News Audio", Proc.DARPA Speech Recognition Workshop, Chantilly, Virginia, pp.97-99, 1997.

    [8].S.Chen, P.Gopalakrishnan, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion", Proc.Darpa Broadcast News Transcription Understanding Workshop, Lansdowne, VA, USA, pp. 127-132, 1998.

    [9].T.Hain, S.E.Johnson, A.Tuerk, P.C.Woodland, S.J.Young, “Segment generation and clustering in the HTK broadcast news transcription system”, Proc.Darpa Broadcast News Transcription and Understanding Workshop, Landsdowne, pp.133-137, 1998.

    [10]. J.Amera, C.Wooters, "A Robust speaker clustering algorithm", Proc.ASRU(Automatic Speech Recognition Understanding) Workshop, U.S. Virgin Islands, pp.411-416, 2003

    [11].B.Zhou, J.H.L.Hansen, "Unsupervised Audio Stream Segmentation and clustering via the Baysian Information Criterion", Proc. ICSLP, Beijing, China, pp. 714-717, 2000.

    [12]. K. Sommez, L. Heck, M. Weintraub, "Speaker Tracking and Detection with Multiple Speakers", Proc. EUROSPEECH, Budapest, Vol. 5, pp. 2219 – 2222, 1999.

    [13].P.C.Woodland, T.Hain, S.Johnson, T.Niesler, A.Tuerk, S.B.Young, “Experiments in Broadcast News Transcription”, Proc.ICASSP, Seattle, Washington, pp.909 ff, 1998.

    [14].L.Wilcox, F.Chen, D.Kimber, V.Balasubramanian, "Segmentation of Speech Using Speaker Identification", Proc. ICASSP, Adelaide, Australia, Vol, pp. 161-164, 1994.

    [15].H.Kim, D.Ertelt, T.Sikora, "Hybrid speaker-based segmentation system using model-level clustering", Proc. ICASSP, Philadelphia, USA, Vol, pp. 745-748, 2005.

    [16].H.Kim, T.Sikora, "Automatic Segmentation of Speakers in Broadcast Audio Material", Proc. SPIE, Vol. 5307, pp.429-438, 2003.

    [17].P.Yu, F.Seide, C.Ma, E.Chang, "An Improved Model-based Speaker Segmentation System", Proc. EUROSPEECH, Geneva, Switzerland, pp. 2025-2028, 2003.

    [18].D.Valj, B.Kacic, B.Horvat, "Usage of frame dropping and frame attenuation algorithms in automatic speech recognition system", IEEE proceeding, pp.149-152, 2003.

    [19].J.Faneuff, "Spatial, spectral, and perceptual nonlinear noise reduction for hands-free microphones in a car", Master Thesis Electrical and computer Engineering, 2002.

    [20]. L. Karray, C. Mokbel, J.

Speaker recognition in multi-speaker environment using support vector machine