Can We Estimate Purchase Intention Based on Zero-shot Speech Emotion Recognition?
Abstract
This paper proposes a zero-shot speech emotion recognition (SER) method that estimates emotions not previously defined in the SER model training. Conventional methods are limited to recognizing emotions defined by a single word. Moreover, we have the motivation to recognize unknown bipolar emotions such as ``I want to buy - I do not want to buy.'' In order to allow the model to define classes using sentences freely and to estimate unknown bipolar emotions, our proposed method expands upon the contrastive language-audio pre-training (CLAP) framework by introducing multi-class and multi-task settings. We also focus on purchase intention as a bipolar emotion and investigate the model's performance to zero-shot estimate it. This study is the first attempt to estimate purchase intention from speech directly. Experiments confirm that the results of zero-shot estimation by the proposed method are at the same level as those of the model trained by supervised learning.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2024
- DOI:
- 10.48550/arXiv.2410.09636
- arXiv:
- arXiv:2410.09636
- Bibcode:
- 2024arXiv241009636N
- Keywords:
-
- Electrical Engineering and Systems Science - Audio and Speech Processing;
- Computer Science - Artificial Intelligence;
- Computer Science - Machine Learning
- E-Print:
- 5 pages, 3 figures, accepted for APSIPA 2024 ASC