Powerful and Extensible WFST Framework for RNN-Transducer Losses
Abstract
This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss. Existing implementations of RNN-T use CUDA-related code, which is hard to extend and debug. WFSTs are easy to construct and extend, and allow debugging through visualization. We introduce two WFST-powered RNN-T implementations: (1) "Compose-Transducer", based on a composition of the WFST graphs from acoustic and textual schema -- computationally competitive and easy to modify; (2) "Grid-Transducer", which constructs the lattice directly for further computations -- most compact, and computationally efficient. We illustrate the ease of extensibility through introduction of a new W-Transducer loss -- the adaptation of the Connectionist Temporal Classification with Wild Cards. W-Transducer (W-RNNT) consistently outperforms the standard RNN-T in a weakly-supervised data setup with missing parts of transcriptions at the beginning and end of utterances. All RNN-T losses are implemented with the k2 framework and are available in the NeMo toolkit.
- Publication:
-
arXiv e-prints
- Pub Date:
- March 2023
- DOI:
- 10.48550/arXiv.2303.10384
- arXiv:
- arXiv:2303.10384
- Bibcode:
- 2023arXiv230310384L
- Keywords:
-
- Electrical Engineering and Systems Science - Audio and Speech Processing;
- Computer Science - Artificial Intelligence;
- Computer Science - Computation and Language;
- Computer Science - Machine Learning;
- Computer Science - Sound
- E-Print:
- To appear in Proc. ICASSP 2023, June 04-10, 2023, Rhodes island, Greece. 5 pages, 5 figures, 3 tables