ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

doi:10.48550/arXiv.2408.03284

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at https://guanjz20.github.io/projects/ReSyncer.

Publication:

arXiv e-prints

Pub Date:

August 2024

DOI:

10.48550/arXiv.2408.03284

arXiv:

arXiv:2408.03284

Bibcode:

2024arXiv240803284G

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Graphics;
Computer Science - Multimedia

E-Print:

Accepted to European Conference on Computer Vision (ECCV), 2024. Project page: https://guanjz20.github.io/projects/ReSyncer

ADS

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Abstract