Word Files
Reference for Downloading Educational Files

Speaker recognition in multi-speaker environment using support vector machine

Number of pages: 117 File Format: word File Code: 31352
Year: 2011 University Degree: Master's degree Category: Electronic Engineering

Tags/Keywords: Recognition of audio segments - Segmentation of speakers - Speakers clustering - Statistical segmentation of the speaker

Part of the Content
Contents & Resources

Summary of Speaker recognition in multi-speaker environment using support vector machine

Electronics group

Master thesis

Abstract:

Speaker identification is one of the topics discussed in speech processing. Speaker identification is the process of identifying who is really speaking and when using the speech signal. The goal is to design a system that can identify the change in the speaker and tag each speaker's speech for the system. It means to specify which speaker spoke in which intervals. Today, this practice has been popularized by a new title that encompasses both the process of separation and labeling called Speaker Diarization. The purpose of segmentation is to divide the speech signal into parts that only contain the speech of one speaker, and the purpose of clustering is to identify the speech parts of a speaker and assign a single label to them. The purpose of this thesis is to design and implement a speaker segmentation and clustering system using new algorithms and also to improve the results of these algorithms for this issue. This system must correctly recognize the change points of the speaker without knowing the previous information about the speaker and finally places all the audio parts related to a speaker in one cluster.

In this dissertation, the speaker recognition system consists of three main stages. In the first step, the non-speech parts are removed from the speech parts of the audio file, in order to increase the accuracy and speed of the system operation in the next steps. Then the speech file is divided into homogeneous parts in which there is only one speaker's speech. In the third step, using appropriate clustering, the speech parts of the previous step, which belong to a speaker, are placed in a cluster. To implement the system, four types of MFCC feature vectors root-MFCC, TDC, and root-TDC and three types of databases were used, and the accuracy of the segmentation stage was 80%, and the accuracy of the clustering stage was 59% using the support vector machine.

Keywords:

Speaker statistical segmentation

Speaker segmentation

Voice segments recognition

Clustering Speakers

Introduction

Today, multimedia data covers a significant part of human knowledge. The amount of multimedia files archived in various institutions has increased significantly in recent years. The accessibility and clarity of these files can be of great help to people who are looking for information. Therefore, searching and retrieving information in this high volume is a task that requires a computer system. And as a result, one of the research areas that has recently received attention is related to the structuring of multimedia files. Among these data, voice information is more important. Because most of the archives contain audio data from TV and radio reports as well as phone conversations. In recent years, extensive research has been started in this field and acceptable results have been obtained. Among the other uses of this field in identifying the guilty, separating the important words of a witness or accused in the court and so on. It can be mentioned.

In the audio application, the main information in the files are the conversations of a number of speakers, and the purpose of the final system is to answer the question of who spoke at what times? Different parts of this research field have been called by different names, such as speaker segmentation [1], speaker recognition [2], robust transcription [3], and speaker indexing [4]. Such systems are used for easy movement of audio data in long audio files (such as: news, meetings and meetings of a company, etc.) that belong to several speakers. Long radio conversations and calculations are environments in which several speakers are present and talk to each other. The ultimate goal of such systems is to implement appropriate methods to distribute audio files to areas where a particular speaker has spoken. Easy access to parts of a speaker's speech is provided by this system. With a large volume of audio data, the importance of these systems increases.

With the increase in the number of text documents available on the Internet, the need for techniques such as text indexing in order to facilitate access and search in these documents increased. Similar to this need, with the increase in the number of audio documents such as lectures, interviews and gatherings.. Obviously, accessing audio documents is much more difficult than accessing text, and listening to a recorded audio file is more time-consuming than reading text, and manual indexing of audio documents is difficult compared to text indexing. The proposed solution to solve this problem is the automatic cataloging of audio documents [5]. In 2001, Pelkan and Sidharun and their group improved the results of the system by reducing the effect of noise on the signal and led to better speaker separation. In 2005, Boulian and Kenny obtained different results by using other feature vectors (or integrating previous methods) and using Gaussian models in the system. In 2005, Yamashita and Matsunaga improved the speaker segmentation results of this system by using audio signal features such as signal pitch frequency, energy, signal maximum frequencies, and three other features.[1] And in the following years, by performing different methods on its different parts, until today, these systems have been completed and the results have been improved.

The aim of this thesis is to design and implement a system that can identify the change in the speaker in an audio file that includes the speech of several speakers and, as far as possible, categorize the speech of each speaker without knowing his previous information. This system can include two basic parts, which are: Speaker segmentation Speaker clustering The work of the segmentation [6] is to divide the speech signal into segments that only contain the speech of one speaker. In the clustering stage [7], the speech parts related to a speaker are identified and categorized and a single label is assigned to it. This article is used in many speech applications that are related to speech recognition or indexing[8] in an environment where several speakers may speak, such as a meeting, conference, news, and the like. This work can not only help advanced speech recognition systems to improve group recognition results, but also help them in identifying and transcribing conversations. As mentioned before, it is also possible to use it in audio indexing, which provides the possibility of searching audio files. Figure (1-1) shows how this system works.

Figure (1-1): Display of speaker segmentation on input speech

The audio file under review is a single-channel recorded audio that contains several audio sources. These audio sources are different and can include several speakers, music, types of noise, etc. to be The type and details of the audio resources in the file depend on the functional characteristics of that file. Generally, speaker segmentation and clustering systems are used in the following three areas: news data, recorded meetings, telephone conversations, as mentioned earlier, these three areas have differences such as the quality of audio recording (bandwidth, microphones, and noise) and the amount and type of resources. Non-speech has the number of speakers, style and structure of speech (duration of speech, order of speakers) and each field has its own issues and problems for speaker segmentation and clustering. Of course, speaker recognition systems try to achieve acceptable and suitable results for all three work areas.[1]

The lower level of work of such a system is the classification of audio data into clusters, each of which belongs to a speaker. Here, it is easy to see that there are two observational [9] (with a supervisor) and non-supervisory [10] (without a supervisor) views in this section. In the first view, there is already information about who is speaking in the audio file. But in the second view, the work of the system is to classify the file into time intervals in which only one speaker, whose identity is hidden from us, speaks. It should be noted that the output of an unsupervised classification system can be used as an input to identification systems [11], and in this way we will have an observational classification system. Therefore, the efficiency as well as the execution time of the obtained monitoring system is better. On the other hand, the performance of these systems also depends on the amount of previous information allowed. This previous information can be speech samples from the speakers, the number of speakers in the audio file, or information about the structure of the recorded file. But in most speaker segmentation and clustering systems, it is assumed that there is no previous information about the speakers and their number.
Contents & References of Speaker recognition in multi-speaker environment using support vector machine

List:

Table of Contents

Chapter One: Introduction of speaker recognition systems

1-1-Introduction..2

1-2-Different working stages of speaker recognition systems. 6

1-2-1-Acoustic segment..7

1-2-2-Speech recognition from other Speech... 8

1-2-3-Speaker gender detection..9

1-2-4-Speaker change detection..9

1-3-Speaker segmentation and clustering methods. Model..11

    1-3-3-Hybrid or combined methods..11

1-4-Clustering..11

1-5-Summary..12

Chapter Two: Recognition of speech from non-speech areas

2-1-Introduction..14

2-2- Structure of speech recognition from non-speech Speech. 16

2-2-1- Preprocessing.. 16

2-2-2- Feature extraction.. 17

2-2-2-1- Energy.. 18

2-2-2-2- Zero crossing rate.. 19

2-2-2-3- Feature extraction with the help of Capstral frequency coefficients on the mel scale. 19

2-2-2-4- LPC coefficients. 23

2-2-2-5- entropy.. 24

2-2-2-6- Intermittent size. Band..28

         2-2-2-8- Other parameters..28

     2-2-3- Threshold calculation..29

     2-2-4- VAD decisions.

         2-2-4-1- Decision based on hidden Markov model. 2-2-4-2- Decision making based on neural networks. 31

2-2-5- Correction of VAD results. 33

2-3- Block diagram of multiple VAD standards.. 33

2-3-1- ETSI AMR standard. 33

2-4-Summary..35

Chapter three: Revealing speaker change

3-1-Introduction..37

3-2-Speaker segmentation..38

   3-2-1-Segmentation based on distance..38

    3-2-2-Segmentation based on model...40

    3-2-3-Hybrid segmentation..40

3-3-Comparison of segmentation methods..40

3-4-Common speaker detection methods..41

3-4-2- Combination of T2 statistics and BIC.45

3-4-2-1- More speed and benefit in T2-BIC segmentation. 47

3-4-3- General likelihood rate interval (GLR..49

3-4-4- KL distance 2.49

3-4-5- Change detection Speaker using DSD. 51. 3-4-6- Cross-BIC (XBIC)...52

3-4-7- Estimation of Gaussian mixture model. (GMM-L). 53

3-5- Summary.. 53

Chapter Four: Classification Methods

4-1-Introduction..55

4-2-Clustering System Components..56

4-3-Clustering Methods..57

4-3-1-Hierarchical Clustering Methods..58

    4-3-1-1-Techniques Ascending clustering.. 59

         4-3-1-2-Descending clustering techniques. 60

4-3-2- Ascending clustering methods.. 61

4-4- Common clustering methods in speaker clustering systems. 61

4-5- Support vector machine classifier.. 63

     4-5-1- Linear support vector machine classifier. 63

4-5-1-1- Classification of separable classes. 4-5-2- Non-linear support vector machines..72

4-6- Summary..74

Chapter five: Implementation and observations of the proposed hybrid system

5-1-Introduction..76

5-2- Structure of the implemented system..77

5-3- Database..80

5-4-Feature extraction..82

5-5-Evaluation criteria of speaker recognition systems..84

5-6-Test results..88

5-6-1- Effect of applying VAD on speech signal..88

5-6-2- Effect of changing VAD window length on system accuracy.89

5-6-3- The effect of changing the length of the BIC window on the segmentation results.Feature vector on the accuracy of the segmentation stage 93 5-6-6-Comparison of the results of the segmentation stage with the use of different feature vectors 95 5-6-7 The effect of gender, speakers, on the correct recognition of segmentation boundaries 96 5-6-8 Accuracy The clustering step of using the support vector machine (SVM) with the MFCC feature vector. 96

5-6-9- The accuracy of the support vector machine clustering step using the root-MFCC feature vector. 97

5-6-10- The effect of changing the kernel function type of the support vector machine on the accuracy of the clustering step. 98

5-7-Summary. 98

Chapter six: Total Classification and suggestions

6-1-Summary and summary of results.100

6-2-Suggestions.101

Resources.

Source:

[1].Xavier.Anguera.Mir, Phd Thesis, "Robust Speaker Diarization for meetings", 2006.

[2].L.Docio, C.Garcia, "Speaker Segmentation, detection and tracking in multi-speaker long audio recordings", Third COST275 Workshop Bimetrics on the internet. 2005.

[3]. Janes.Zibert, B.Vesnicer, F.Mihelie, "A System for speaker detection and tracking in audio broadcast news", IEEE proceeding, pp.51-61, 2008.

[4].A.F.Martin, M.A.Przybocki, "Speaker recognition in a multi-speaker environment", Euro speech 2001 Scandinavia, Coference on Speech Communication and Technology, 2001.
[5]. R.O.Duda, P.E.Hart, D.G.Stork, "Pattern Classification", John Wiley and sons, 2nd edition, 2007.

[6]. Christopher M. Bishop, "Pattern Recognition and Machine learning", pp.738, Springer 2006.

[7]. M.A.Siegler,U.Jain,B.Raj, M.Stern, "Automatic Segmentation, Classification and Clustering of Broadcast News Audio", Proc.DARPA Speech Recognition Workshop, Chantilly, Virginia, pp.97-99, 1997.

[8].S.Chen, P.Gopalakrishnan, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion", Proc.Darpa Broadcast News Transcription Understanding Workshop, Lansdowne, VA, USA, pp. 127-132, 1998.

[9].T.Hain, S.E.Johnson, A.Tuerk, P.C.Woodland, S.J.Young, “Segment generation and clustering in the HTK broadcast news transcription system”, Proc.Darpa Broadcast News Transcription and Understanding Workshop, Landsdowne, pp.133-137, 1998.

[10]. J.Amera, C.Wooters, "A Robust speaker clustering algorithm", Proc.ASRU(Automatic Speech Recognition Understanding) Workshop, U.S. Virgin Islands, pp.411-416, 2003

[11].B.Zhou, J.H.L.Hansen, "Unsupervised Audio Stream Segmentation and clustering via the Baysian Information Criterion", Proc. ICSLP, Beijing, China, pp. 714-717, 2000.

[12]. K. Sommez, L. Heck, M. Weintraub, "Speaker Tracking and Detection with Multiple Speakers", Proc. EUROSPEECH, Budapest, Vol. 5, pp. 2219 – 2222, 1999.

[13].P.C.Woodland, T.Hain, S.Johnson, T.Niesler, A.Tuerk, S.B.Young, “Experiments in Broadcast News Transcription”, Proc.ICASSP, Seattle, Washington, pp.909 ff, 1998.

[14].L.Wilcox, F.Chen, D.Kimber, V.Balasubramanian, "Segmentation of Speech Using Speaker Identification", Proc. ICASSP, Adelaide, Australia, Vol, pp. 161-164, 1994.

[15].H.Kim, D.Ertelt, T.Sikora, "Hybrid speaker-based segmentation system using model-level clustering", Proc. ICASSP, Philadelphia, USA, Vol, pp. 745-748, 2005.

[16].H.Kim, T.Sikora, "Automatic Segmentation of Speakers in Broadcast Audio Material", Proc. SPIE, Vol. 5307, pp.429-438, 2003.

[17].P.Yu, F.Seide, C.Ma, E.Chang, "An Improved Model-based Speaker Segmentation System", Proc. EUROSPEECH, Geneva, Switzerland, pp. 2025-2028, 2003.

[18].D.Valj, B.Kacic, B.Horvat, "Usage of frame dropping and frame attenuation algorithms in automatic speech recognition system", IEEE proceeding, pp.149-152, 2003.

[19].J.Faneuff, "Spatial, spectral, and perceptual nonlinear noise reduction for hands-free microphones in a car", Master Thesis Electrical and computer Engineering, 2002.

[20]. L. Karray, C. Mokbel, J.

How To Access The File

Speaker recognition in multi-speaker environment using support vector machine

Number of pages: 118 Category: Electronic Engineering

Electronics Department of Master's Thesis Abstract: Speaker identification is one of the topics discussed in speech processing. Speaker identification is the process of identifying who is really speaking and when using the speech signal. The goal is to design a system that can identify the change in the speaker and tag each speaker's speech for the system. It means to specify ...

Using an improved colonial competition algorithm for image segmentation

Number of pages: 89 Category: Computer Engineering

Dissertation for Master's Degree in Computer Engineering - Artificial Intelligence Abstract Image segmentation is a basic process in many applications of image processing and machine vision, which can be considered as the first low-level processing step in digital image processing. Image segmentation has various applications such as medical image processing, face recognition, ...

Presenting a new method in information clustering using a combination of bat algorithm and Fuzzy c-means

Number of pages: 102 Category: Industrial Engineering

Master's thesis in the field of automation engineering and precision instruments, abstract presenting a new method in information clustering using a combination of bat algorithm and Fuzzy c-means. The similarity between data within each cluster is maximum and the similarity between data within different clusters is minimum. Fuzzy c-means is also a fuzzy clustering technique that ...

Content analysis of religious Facebook pages

Number of pages: 140 Category: Social Sciences - Sociology

Dissertation in Master's Degree Abstract The subject of this research is the analysis of the content of religious Facebook pages, and during it, an attempt has been made to analyze the structure of religious pages in the Facebook space, considering the importance of Facebook in the virtual space and also the place of religion in life. The existence of many religious pages raises ...

Presenting and comparing three two-stage models for customer segmentation based on their value using K-Means, SOM and RFM data mining tools (Case study: Iran Apple Center chain stores)

Number of pages: 124 Category: Industrial Engineering

Dissertation for Master's Degree in Industrial Engineering, Mehr 2011, Identifying the value[1] of customers, one of the main components of success in the store. There are different types that are more attention today than before

Automatic segmentation of teeth using X-ray images

Number of pages: 80 Category: Computer Engineering

Dissertation for Master's Degree in Artificial Intelligence Abstract One of the most complex tasks in digital image processing is image segmentation. Due to increasing attention to this technique by researchers and turning it into a vital role, it is used in many practical fields such as medical applications. Today, in modern dentistry, techniques based on the use of computers, ...

Consensus clustering on heterogeneous distributed data

Number of pages: 120 Category: Computer Engineering

Master's Thesis in Computer Engineering - Software Orientation Abstract Clustering can be considered one of the most important steps in data analysis. Many clustering methods have been developed and presented so far. One of these methods that has been studied in recent studies is consensus clustering method. The goal of consensus clustering is to combine several initial ...

Presenting a model to identify the influencing factors and their impact factor in the profit and loss of the third party car insurance of insurance companies by means of data mining methods, a case study of Iran Insurance Company.

Number of pages: 100 Category: Computer Engineering

Master's thesis in the field of computer - software engineering. Abstract: The review of car insurance information has shown that factors such as the type of car used, having a driver's license, the type of license and its compatibility or non-compatibility with the vehicle, the amount of the insurance premium, the amount of insurance policy obligations, the quality of the car ...

Analysis of the public diplomacy discourse of Iran and the United States during the years 1390 to 1392

Number of pages: 196 Category: Social Sciences - Sociology

Dissertation for M.A. specialization: Social Communication Sciences Abstract: Public diplomacy has found a special place in the international system due to the increasing role of multiple actors in international interactions compared to classic and traditional diplomacy. The development of new information technology has changed the classical diplomacy tool which was based on ...

The effect of service customization on the satisfaction, trust and loyalty of customers of National Bank branches in Langrod city

Number of pages: 114 Category: Management

Dissertation for the Master's Degree in Business Administration, Orientation: International. Abstract The purpose of this research is the impact of service customization on the satisfaction, trust and loyalty of customers of Bank Melli branches in Langrod city. The statistical population of the research consists of all the customers of Bank Melli branches in Langrod city. Based ...

Speaker recognition in multi-speaker environment using support vector machine

Summary of Speaker recognition in multi-speaker environment using support vector machine

Contents & References of Speaker recognition in multi-speaker environment using support vector machine