Joint Audio and Speech Understanding
Abstract
Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2023
- DOI:
- 10.48550/arXiv.2309.14405
- arXiv:
- arXiv:2309.14405
- Bibcode:
- 2023arXiv230914405G
- Keywords:
-
- Computer Science - Sound;
- Computer Science - Artificial Intelligence;
- Electrical Engineering and Systems Science - Audio and Speech Processing
- E-Print:
- Accepted at ASRU 2023. Code, dataset, and pretrained models are at https://github.com/yuangongnd/ltu. Interactive demo at https://huggingface.co/spaces/yuangongfdu/ltu-2