Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces
Abstract
In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2020
- DOI:
- 10.48550/arXiv.2005.09150
- arXiv:
- arXiv:2005.09150
- Bibcode:
- 2020arXiv200509150Z
- Keywords:
-
- Electrical Engineering and Systems Science - Audio and Speech Processing;
- Computer Science - Computation and Language
- E-Print:
- In proceedings Interspeech 2020