Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
Abstract
We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings. Unlike conventional speech recognition models, our model learns longer conversational-context information that spans across sentences and is consequently better at recognizing long conversations. Specifically, we propose to use the text-based external word and/or sentence embeddings (i.e., fastText, BERT) within an end-to-end framework, yielding a significant improvement in word error rate with better conversational-context representation. We evaluated the models on the Switchboard conversational speech corpus and show that our model outperforms standard end-to-end speech recognition models.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2019
- DOI:
- 10.48550/arXiv.1906.11604
- arXiv:
- arXiv:1906.11604
- Bibcode:
- 2019arXiv190611604K
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Sound;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- ACL 2019