Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection

doi:10.48550/arXiv.2007.09245

Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection

In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. The streaming capability is achieved by using unidirectional LSTM layers and a causal mean aggregation layer to form the final utterance-level prediction up to the current frame. In order to avoid redundant computation during online streaming inference, we use a caching mechanism for every convolution operation. Experimental results on a device-directed vs. non device-directed task show that the proposed model yields an equal error rate reduction of 41% compared to our previous best model on this task. Furthermore, we show that the proposed model is able to accurately predict earlier in time compared to the attention-based models.

Publication:

arXiv e-prints

Pub Date:

July 2020

DOI:

10.48550/arXiv.2007.09245

arXiv:

arXiv:2007.09245

Bibcode:

2020arXiv200709245T

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

NASA/ADS

Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection

Abstract