Abstract

In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.

Keywords

Computer scienceNISTRobustness (evolution)Speech recognitionLeverage (statistics)Classifier (UML)Artificial neural networkArtificial intelligenceTraining setPattern recognition (psychology)Speaker recognitionExtractorDeep neural networks

Affiliated Institutions

Related Publications

Publication Info

Year
2018
Type
article
Pages
5329-5333
Citations
2529
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

2529
OpenAlex
329
Influential
1577
CrossRef

Cite This

David Snyder, Daniel Garcia-Romero, Gregory Sell et al. (2018). X-Vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 5329-5333. https://doi.org/10.1109/icassp.2018.8461375

Identifiers

DOI
10.1109/icassp.2018.8461375

Data Quality

Data completeness: 81%