Speaker recognition in multi-speaker environment using support vector machine

Number of pages: 118 File Format: word File Code: 32227
Year: 2011 University Degree: Master's degree Category: Electronic Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Speaker recognition in multi-speaker environment using support vector machine

    Electronics group

    Master thesis

    Abstract:

    Speaker identification is one of the topics discussed in speech processing. Speaker identification is the process of identifying who is really speaking and when using the speech signal. The goal is to design a system that can identify the change in the speaker and tag each speaker's speech for the system. It means to specify which speaker spoke in which intervals. Today, this practice has been popularized by a new title that encompasses both the process of separation and labeling called Speaker Diarization. The purpose of segmentation is to divide the speech signal into parts that only contain the speech of one speaker, and the purpose of clustering is to identify the speech parts of a speaker and assign a single label to them. The purpose of this thesis is to design and implement a speaker segmentation and clustering system using new algorithms and also to improve the results of these algorithms for this issue. This system must correctly recognize the change points of the speaker without knowing the previous information about the speaker and finally put all the audio parts related to a speaker in one cluster. In the first step, the non-speech parts are removed from the speech parts of the audio file, in order to increase the accuracy and speed of the system operation in the next steps. Then the speech file is divided into homogeneous parts in which there is only one speaker's speech. In the third step, using appropriate clustering, the speech parts of the previous step, which belong to a speaker, are placed in a cluster. To implement the system, four types of MFCC feature vectors root-MFCC, TDC, and root-TDC and three types of databases have been used, and the accuracy of the segmentation stage was 80%, and the accuracy of the clustering stage was 59% using the support vector machine. rtl;">speaker segmentation

    audio segments recognition

    speaker clustering

    introduction

    Today, multimedia data include a significant part of human knowledge. The amount of multimedia files archived in various institutions has increased significantly in recent years. The accessibility and clarity of these files can be of great help to people who are looking for information. Therefore, searching and retrieving information in this high volume is a task that requires a computer system. And as a result, one of the research areas that has recently received attention is related to the structuring of multimedia files. Among these data, voice information is more important. Because most of the archives contain audio data from TV and radio reports as well as phone conversations. In recent years, extensive research has been started in this field and acceptable results have been obtained. Among the other uses of this field in identifying the guilty, separating the important words of a witness or accused in the court and so on. It can be mentioned.

    In the audio application, the main information in the files is the speech of a number of speakers, and the purpose of the final system is to answer the question of who spoke at what times? Different parts of this research field have different names such as: Speaker segmentation [1], speaker detection [2], robust transcription [3], and speaker indexing [4] have been called. Such systems are used for easy movement of audio data in long audio files (such as: news, meetings and meetings of a company, etc.) that belong to several speakers. Long radio conversations and calculations are environments in which several speakers are present and talk to each other. The ultimate goal of such systems is to implement appropriate methods to distribute audio files to areas where a particular speaker has spoken. Easy access to parts of a speaker's speech is provided by this system.

    With the increase in the number of text documents available on the Internet, the need for techniques such as text indexing in order to facilitate access and search in these documents increased. Similar to this need, with the increase in the number of audio documents such as lectures, interviews and gatherings. was created Obviously, accessing audio documents is much more difficult than accessing text, and listening to a recorded audio file is more time-consuming than reading text, and manual indexing of audio documents is difficult compared to text indexing. The proposed solution to solve this problem is the automatic cataloging of audio documents[5]. In 2001, Pelkan and Sidharun and their group improved the results of the system by reducing the effect of noise on the signal and led to better speaker separation. In 2005, Boulian and Kenny obtained different results by using other feature vectors (or integrating previous methods) and using Gaussian models in the system. In 2005, Yamashita and Matsunaga improved the speaker segmentation results of this system by using audio signal features such as signal pitch frequency, energy, signal maximum frequencies, and three other features.[1] And in the following years, by performing different methods on its different parts, until today these systems have been completed and the results have been improved.

    The purpose of this thesis is to design and implement a system that can identify the change in the speaker in an audio file that includes the speech of several speakers and, as far as possible, categorize the speech of each speaker without knowing his previous information. This system can include two basic parts, which are: - Speaker segmentation - Speaker clustering - The work of the segmentation part[6] is to divide the speech signal into segments that only contain the speech of one speaker. In the clustering stage [7], the speech parts related to a speaker are identified and categorized and a single label is assigned to it. This article is used in many speech applications that are related to speech recognition or indexing[8] in an environment where several speakers may speak, such as a meeting, conference, news, and the like. This work can not only help advanced speech recognition systems to improve the results of group recognition, but also help them in identifying and transcribing conversations. as information varies depending on who utters the spoken words. Within the speech technologies, the broad topic of acoustic indexing studies the classification of sounds into different classes/sources. Algorithms used for acoustic indexing worry about the correct classification of the sounds, but not necessarily about the correct separation of them when more than one exist in the same audio segment. These purely classification techniques have sometimes been called audio clustering, which benefits from the broad topic of clustering, well studies in many areas. When multiple sounds appear in the same audio signal one must turn his attention to techniques called as audio diarization to process them. These can include particular speakers, music, background noise sources.

    When the possible classes correspond to the different speakers in a recording these techniques

    are called speaker diarization. Speaker diarization can be defined in terms of being a subtype of audio diarization, where the speech segments of the signal are broken into the different speakers. They aim at answering the question "Who spoke when?" given an audio signal. Algorithms doing speaker diarization need to locate each speaker turn and assign them to the appropriate speaker cluster. The output of the system is a set of segments with a unique ID assigned to each person who intervenes in the recording.

  • Contents & References of Speaker recognition in multi-speaker environment using support vector machine

    List:

    First: Introduction of speaker recognition systems

    1-1-Introduction..2

    1-2-Different working stages of speaker recognition systems.

    1-2-3-Speaker gender detection..9

    1-2-4-Speaker change detection..9

    1-3-Speaker segmentation and clustering methods.10

    1-3-1-Methods based on distance..10

    1-3-2-Methods based on model..11

        1-3-3-Hybrid or combined methods..11

    1-4-Clustering..11

    1-5-Summary..12

    Chapter Two: Recognizing speech from non-speech areas

    2-1-Introduction..14

    2-2-Structure of speech recognition from non-speech.16

        2-2-1- Pre-processing..16

         2-2-2-Feature extraction..17

             2-2-2-1-Energy...18

              2-2-2-2-Zero crossing rate...19

              2-2-2-3- Feature extraction with the help of scale frequency cepstral coefficients. Mel.19

             2-2-2-4- LPC coefficients. 23

    2-2-2-5- Entropy. 2-2-2-8- Other parameters.. 28

    2-2-3- Threshold calculation.. 29

    2-2-4- VAD decisions. 29

             2-2-4-1- Decision-making based on the hidden Markov model. 31

    2-2-5- Correction of VAD results. 33

    2-3- Block diagram of several VAD standards. 33

    2-3-1- ETSI AMR standard. 33

    2-3-2- GSM algorithm. Speaker change detection

    3-1-Introduction...37

    3-2-Speaker segmentation...38

    3-3-Comparison of segmentation methods..40

    3-4-Common methods of speaker detection..41

    3-4-1- Bayesian information criterion (BIC.41

             3-4-1-2- Segmentation using the statistical model of the speaker.

    BIC.45

          3-4-2-1- More speed and gain in segmentation T2-BIC.47

    3-4-3- General Likelihood Rate Distance (GLR..49

         3-4-4- KL2.49 distance

         3-4-5- Speaker change detection using DSD.51

         3-4-6- Cross-BIC (Cross-BIC (XBIC)).

    4-1-Introduction..55

    4-2-Components of clustering system..56

    4-3-Clustering methods..57

    4-3-1-Hierarchical clustering methods..58

    4-3-1-1-Ascending clustering techniques..59

           4-3-1-2- Top-down clustering techniques. 60

    4-3-2- Upward clustering methods. 61

    4-4- Common clustering methods in speaker clustering systems. 61

    4-5- Support vector machine classifier.. 63

    4-5-1- Support vector machine classifier. Linear. 63

    4-5-1-1- Classification of separable classes. 63

    4-5-1-2- Classification of inseparable classes. 68

    4-5-1-3- Classification of multi-class data with support vector machines. 71

    4-5-2- Non-support vector machines. Linear..72

    4-6-Summary..74

    Chapter Five: Implementation and observations of the proposed hybrid system

    5-1-Introduction..76

    5-2-Structure of the implemented system..77

    5-3-Data base..80

    5-4-Feature extraction..82

    5-5-Evaluation criteria of speaker recognition systems..84

    5-6-Test results..88

    5-6-1- The effect of applying VAD on the speech signal..88

    5-6-2- The effect of changing the length of the VAD window on the accuracy of the system.89

    5-6-3- The effect of changing the length of the BIC window on the segmentation results.89

    d

    5-6-4-accuracy.resulting.from.segmentation.on.two.types.of.data usingSegmentation. 93

    5-6-6-comparison of the results of the segmentation stage using different feature vectors. 95

    5-6-7-the effect of gender, speakers, on the correct identification of the segmentation boundaries. 96

    5-6-8-the accuracy of the clustering stage using the support vector machine (SVM) with the feature vector MFCC.96

    5-6-9-accuracy of the support vector machine clustering stage using the root-MFCC feature vector.97

    5-6-10- the effect of changing the type of support vector machine kernel function on the accuracy of the clustering stage.98

    5-7-summary.98

    Chapter six: summary and suggestions

    6-1-summary and Summary of results. 100

    6-2-Recommendations. 101

    Resources.

    Source:

    [1].Xavier.Anguera.Mir, Phd Thesis, "Robust Speaker Diarization for meetings", 2006.

    [2].L.Docio, C.Garcia, "Speaker Segmentation, detection and tracking in multi-speaker long audio recordings", Third COST275 Workshop Bimetrics on the internet. 2005.

    [3]. Janes.Zibert, B.Vesnicer, F.Mihelie, "A System for speaker detection and tracking in audio broadcast news", IEEE proceeding, pp.51-61, 2008.

    [4].A.F.Martin, M.A.Przybocki, "Speaker recognition in a multi-speaker environment", Euro speech 2001 Scandinavia, Coference on Speech Communication and Technology, 2001.

    [5]. R.O.Duda, P.E.Hart, D.G.Stork, "Pattern Classification", John Wiley and sons, 2nd edition, 2007.

    [6]. Christopher M. Bishop, "Pattern Recognition and Machine learning", pp.738, Springer 2006.

    [7]. M.A.Siegler,U.Jain,B.Raj, M.Stern, "Automatic Segmentation, Classification and Clustering of Broadcast News Audio", Proc.DARPA Speech Recognition Workshop, Chantilly, Virginia, pp.97-99, 1997.

    [8].S.Chen, P.Gopalakrishnan, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion", Proc.Darpa Broadcast News Transcription Understanding Workshop, Lansdowne, VA, USA, pp. 127-132, 1998.

    [9].T.Hain, S.E.Johnson, A.Tuerk, P.C.Woodland, S.J.Young, “Segment generation and clustering in the HTK broadcast news transcription system”, Proc.Darpa Broadcast News Transcription and Understanding Workshop, Landsdowne, pp.133-137, 1998.

    [10]. J.Amera, C.Wooters, "A Robust speaker clustering algorithm", Proc.ASRU(Automatic Speech Recognition Understanding) Workshop, U.S. Virgin Islands, pp.411-416, 2003

    [11].B.Zhou, J.H.L.Hansen, "Unsupervised Audio Stream Segmentation and clustering via the Baysian Information Criterion", Proc. ICSLP, Beijing, China, pp. 714-717, 2000.

    [12]. K. Sommez, L. Heck, M. Weintraub, "Speaker Tracking and Detection with Multiple Speakers", Proc. EUROSPEECH, Budapest, Vol. 5, pp. 2219 – 2222, 1999.

    [13].P.C.Woodland, T.Hain, S.Johnson, T.Niesler, A.Tuerk, S.B.Young, “Experiments in Broadcast News Transcription”, Proc.ICASSP, Seattle, Washington, pp.909 ff, 1998.

    [14].L.Wilcox, F.Chen, D.Kimber, V.Balasubramanian, "Segmentation of Speech Using Speaker Identification", Proc. ICASSP, Adelaide, Australia, Vol, pp. 161-164, 1994.

    [15].H.Kim, D.Ertelt, T.Sikora, "Hybrid speaker-based segmentation system using model-level clustering", Proc. ICASSP, Philadelphia, USA, Vol, pp. 745-748, 2005.

    [16].H.Kim, T.Sikora, "Automatic Segmentation of Speakers in Broadcast Audio Material", Proc. SPIE, Vol. 5307, pp.429-438, 2003.

    [17].P.Yu, F.Seide, C.Ma, E.Chang, "An Improved Model-based Speaker Segmentation System", Proc. EUROSPEECH, Geneva, Switzerland, pp. 2025-2028, 2003.

    [18].D.Valj, B.Kacic, B.Horvat, "Usage of frame dropping and frame attenuation algorithms in automatic speech recognition system", IEEE proceeding, pp.149-152, 2003.

    [19].J.Faneuff, "Spatial, spectral, and perceptual nonlinear noise reduction for hands-free microphones in a car", Master Thesis Electrical and computer Engineering, 2002.

    [20]. L. Karray, C. Mokbel, J.

Speaker recognition in multi-speaker environment using support vector machine