X-Vectors: Robust DNN Embeddings for Speaker Recognition

Abstract

In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.

Keywords

Computer scienceNISTRobustness (evolution)Speech recognitionLeverage (statistics)Classifier (UML)Artificial neural networkArtificial intelligenceTraining setPattern recognition (psychology)Speaker recognitionExtractorDeep neural networks

Affiliated Institutions

Johns Hopkins University US

Related Publications

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Brecht Desplanques , Jenthe Thienpondt , Kris Demuynck

Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) t...

2020 1214 citations

Interactive clustering techniques for selecting speaker-independent reference templates for isolated word recognition

S. Levinson , L. R. Rabiner , A. E. Rosenberg +1 more

It is demonstrated that clustering can be a powerful tool for selecting reference templates for speaker-independent word recognition. We describe a set of clustering techniques ...

1979 IEEE Transactions on Acoustics Speech... 76 citations

Supervised Speech Separation Based on Deep Learning: An Overview

DeLiang Wang , Jitong Chen

Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent...

2018 IEEE/ACM Transactions on Audio Speech... 1453 citations

Speaker-independent phone recognition using hidden Markov models

K.-F. Lee , Hsiao-Wuen Hon

Hidden Markov modeling is extended to speaker-independent phone recognition. Using multiple codebooks of various linear-predictive-coding (LPC) parameters and discrete hidden Ma...

1989 IEEE Transactions on Acoustics Speech... 931 citations

Global optimization of a neural network-hidden Markov model hybrid

Yoshua Bengio , Renato De Mori , Giovanni Flammia +1 more

An original method for integrating artificial neural networks (ANN) with hidden Markov models (HMM) is proposed. ANNs are suitable for performing phonetic classification, wherea...

2002 18 citations

Publication Info

Year: 2018
Type: article
Pages: 5329-5333
Citations: 2529
Access: Closed

External Links

Download PDF (Free) View on DOI.org Semantic Scholar

Social Impact

Altmetric

X-Vectors: Robust DNN Embeddings for Speaker Recognition

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

2529

OpenAlex

329

Influential

1577

CrossRef

Cite This

APA Style

                            
                                    David Snyder, 
                                
                                    Daniel Garcia-Romero, 
                                
                                    Gregory Sell
                                
                                et al.
                            
                            (2018). 
                            X-Vectors: Robust DNN Embeddings for Speaker Recognition. 
                            2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
                            
                            , 5329-5333.
                            https://doi.org/10.1109/icassp.2018.8461375

Identifiers

DOI: 10.1109/icassp.2018.8461375

Data Quality

Data completeness: 81%