Abstract

Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel’s statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

Keywords

Computer sciencePoolingArtificial neural networkSpeech recognitionTime delay neural networkChannel (broadcasting)Pattern recognition (psychology)Frame (networking)Context (archaeology)Artificial intelligenceSpeaker recognitionTelecommunications

Affiliated Institutions

Related Publications

A review of large-vocabulary continuous-speech

Considerable progress has been made in speech-recognition technology over the last few years and nowhere has this progress been more evident than in the area of large-vocabulary...

1996 IEEE Signal Processing Magazine 216 citations

Publication Info

Year
2020
Type
article
Citations
1214
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1214
OpenAlex

Cite This

Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. . https://doi.org/10.21437/interspeech.2020-2650

Identifiers

DOI
10.21437/interspeech.2020-2650