Pushing the Limits of Non-Autoregressive Speech Recognition
Abstract
We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 2021
- DOI:
- 10.48550/arXiv.2104.03416
- arXiv:
- arXiv:2104.03416
- Bibcode:
- 2021arXiv210403416N
- Keywords:
-
- Electrical Engineering and Systems Science - Audio and Speech Processing;
- Computer Science - Computation and Language;
- Computer Science - Machine Learning;
- Computer Science - Sound
- E-Print:
- Proceedings of INTERSPEECH