ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Abstract

Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel’s statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

Keywords

Computer sciencePoolingArtificial neural networkSpeech recognitionTime delay neural networkChannel (broadcasting)Pattern recognition (psychology)Frame (networking)Context (archaeology)Artificial intelligenceSpeaker recognitionTelecommunications

Affiliated Institutions

Ghent University BE

Related Publications

Phoneme recognition using time-delay neural networks

Alexander Waibel , Toshiyuki Hanazawa , Geoffrey E. Hinton +2 more

The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of...

1989 IEEE Transactions on Acoustics Speech... 2619 citations

Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

Ossama Abdel‐Hamid , Abdelrahman Mohamed , Hui Jiang +1 more

Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of loca...

2012 885 citations

Deep Belief Networks using discriminative features for phone recognition

Abdelrahman Mohamed , Tara N. Sainath , George E. Dahl +3 more

Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of fe...

2011 289 citations

Res2Net: A New Multi-Scale Backbone Architecture

Shanghua Gao , Ming‐Ming Cheng , Kai Zhao +3 more

Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstra...

2019 IEEE Transactions on Pattern Analysis... 3082 citations

A review of large-vocabulary continuous-speech

Steve Young

Considerable progress has been made in speech-recognition technology over the last few years and nowhere has this progress been more evident than in the area of large-vocabulary...

1996 IEEE Signal Processing Magazine 216 citations

Publication Info

Year: 2020
Type: article
Citations: 1214
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

1214

OpenAlex

Cite This

APA Style

                            
                                    Brecht Desplanques, 
                                
                                    Jenthe Thienpondt, 
                                
                                    Kris Demuynck
                                
                            (2020). 
                            ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. 
                            
                            .
                            https://doi.org/10.21437/interspeech.2020-2650

Identifiers

DOI: 10.21437/interspeech.2020-2650