Responsive Listening Head Generation: A Benchmark Dataset and Baseline

doi:10.48550/arXiv.2112.13548

Responsive Listening Head Generation: A Benchmark Dataset and Baseline

We present a new listening head generation benchmark, for synthesizing responsive feedbacks of a listener (e.g., nod, smile) during a face-to-face conversation. As the indispensable complement to talking heads generation, listening head generation has seldomly been studied in literature. Automatically synthesizing listening behavior that actively responds to a talking head, is critical to applications such as digital human, virtual agents and social robots. In this work, we propose a novel dataset "ViCo", highlighting the listening head generation during a face-to-face conversation. A total number of 92 identities (67 speakers and 76 listeners) are involved in ViCo, featuring 483 clips in a paired "speaking-listening" pattern, where listeners show three listening styles based on their attitudes: positive, neutral, negative. Different from traditional speech-to-gesture or talking-head generation, listening head generation takes as input both the audio and visual signals from the speaker, and gives non-verbal feedbacks (e.g., head motions, facial expressions) in a real-time manner. Our dataset supports a wide range of applications such as human-to-human interaction, video-to-video translation, cross-modal understanding and generation. To encourage further research, we also release a listening head generation baseline, conditioning on different listening attitudes. Code & ViCo dataset: https://project.mhzhou.com/vico.

Publication:

arXiv e-prints

Pub Date:

December 2021

DOI:

10.48550/arXiv.2112.13548

arXiv:

arXiv:2112.13548

Bibcode:

2021arXiv211213548Z

Keywords:

Computer Science - Computer Vision and Pattern Recognition

E-Print:

Accepted by ECCV 2022

NASA/ADS

Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Abstract