WakeUpNet: A Mobile-Transformer based Framework for End-to-End Streaming Voice Trigger

doi:10.48550/arXiv.2210.02904

WakeUpNet: A Mobile-Transformer based Framework for End-to-End Streaming Voice Trigger

End-to-end models have gradually become the main technical stream for voice trigger, aiming to achieve an utmost prediction accuracy but with a small footprint. In present paper, we propose an end-to-end voice trigger framework, namely WakeupNet, which is basically structured on a Transformer encoder. The purpose of this framework is to explore the context-capturing capability of Transformer, as sequential information is vital for wakeup-word detection. However, the conventional Transformer encoder is too large to fit our task. To address this issue, we introduce different model compression approaches to shrink the vanilla one into a tiny one, called mobile-Transformer. To evaluate the performance of mobile-Transformer, we conduct extensive experiments on a large public-available dataset HiMia. The obtained results indicate that introduced mobile-Transformer significantly outperforms other frequently used models for voice trigger in both clean and noisy scenarios.

Publication:

arXiv e-prints

Pub Date:

October 2022

DOI:

10.48550/arXiv.2210.02904

arXiv:

arXiv:2210.02904

Bibcode:

2022arXiv221002904Z

Keywords:

Computer Science - Sound;
Electrical Engineering and Systems Science - Audio and Speech Processing

ADS

WakeUpNet: A Mobile-Transformer based Framework for End-to-End Streaming Voice Trigger

Abstract