Delay-penalized transducer for low-latency streaming ASR

doi:10.48550/arXiv.2211.00490

Delay-penalized transducer for low-latency streaming ASR

In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Specifically, our method adds a small constant times (T/2 - t), where T is the number of frames and t is the current frame, to all the non-blank log-probabilities (after normalization) that are fed into the two dimensional transducer recursion. For both streaming Conformer models and unidirectional long short-term memory (LSTM) models, experimental results show that it can significantly reduce the symbol delay with an acceptable performance degradation. Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay. Our work is open-sourced and publicly available (https://github.com/k2-fsa/k2).

Publication:

arXiv e-prints

Pub Date:

October 2022

DOI:

10.48550/arXiv.2211.00490

arXiv:

arXiv:2211.00490

Bibcode:

2022arXiv221100490K

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Computation and Language;
Computer Science - Machine Learning;
Computer Science - Sound

E-Print:

Submitted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

ADS

Delay-penalized transducer for low-latency streaming ASR

Abstract