Abstract

We present SpecAugment, a simple data augmentation method for speech\nrecognition. SpecAugment is applied directly to the feature inputs of a neural\nnetwork (i.e., filter bank coefficients). The augmentation policy consists of\nwarping the features, masking blocks of frequency channels, and masking blocks\nof time steps. We apply SpecAugment on Listen, Attend and Spell networks for\nend-to-end speech recognition tasks. We achieve state-of-the-art performance on\nthe LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.\nOn LibriSpeech, we achieve 6.8% WER on test-other without the use of a language\nmodel, and 5.8% WER with shallow fusion with a language model. This compares to\nthe previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we\nachieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set\nwithout the use of a language model, and 6.8%/14.1% with shallow fusion, which\ncompares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.\n

Keywords

Speech recognitionComputer scienceLanguage modelMasking (illustration)SpellSet (abstract data type)Feature (linguistics)Artificial neural networkAcoustic modelState (computer science)Artificial intelligenceSpeech processingAlgorithmLinguistics

Affiliated Institutions

Related Publications

Publication Info

Year
2019
Type
article
Pages
2613-2617
Citations
3338
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

3338
OpenAlex

Cite This

Daniel Park, William Chan, Yu Zhang et al. (2019). SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. , 2613-2617. https://doi.org/10.21437/interspeech.2019-2680

Identifiers

DOI
10.21437/interspeech.2019-2680