TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

doi:10.48550/arXiv.2409.07841

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

Publication:

arXiv e-prints

Pub Date:

September 2024

DOI:

10.48550/arXiv.2409.07841

arXiv:

arXiv:2409.07841

Bibcode:

2024arXiv240907841T

Keywords:

Computer Science - Sound;
Computer Science - Machine Learning;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

Submitted to ICASSP 2025

NASA/ADS

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Abstract