Abstract

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer.Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3%without using a language model and 1.9%/3.9%with an external language model on test/testother.We also observe competitive performance of 2.7%/6.3%with a small model of only 10M parameters.

Keywords

TransformerComputer scienceOverlap–add methodSpeech recognitionConvolution (computer science)Artificial intelligenceNatural language processingMathematicsElectrical engineeringFourier transformEngineeringVoltageArtificial neural network

Affiliated Institutions

Related Publications

Publication Info

Year
2020
Type
article
Citations
2442
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

2442
OpenAlex
435
Influential
1906
CrossRef

Cite This

Anmol Gulati, James Qin, Chung‐Cheng Chiu et al. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. Interspeech 2020 . https://doi.org/10.21437/interspeech.2020-3015

Identifiers

DOI
10.21437/interspeech.2020-3015
arXiv
2005.08100

Data Quality

Data completeness: 84%