A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

doi:10.48550/arXiv.2409.09914

A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate assessment metrics predicted by GPT-4o and GPT-Whisper examining their correlations with human-based quality and intelligibility assessments, and character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is not effective for audio analysis; whereas, GPT-Whisper demonstrates higher prediction, showing moderate correlation with speech quality and intelligibility, and high correlation with CER. Compared to supervised non-intrusive neural speech assessment models, namely MOS-SSL and MTI-Net, GPT-Whisper yields a notably higher Spearman's rank correlation with the CER of Whisper. These findings validate GPT-Whisper as a reliable method for accurate zero-shot speech assessment without requiring additional training data (speech data and corresponding assessment scores).

Publication:

arXiv e-prints

Pub Date:

September 2024

DOI:

10.48550/arXiv.2409.09914

arXiv:

arXiv:2409.09914

Bibcode:

2024arXiv240909914Z

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

ADS

A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

Abstract