Motion and Context-Aware Audio-Visual Conditioned Video Prediction
Abstract
The existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame is extremely challenging because of the high-dimensional image space. To this end, we decouple the audio-visual conditioned video prediction into motion and appearance modeling. The multimodal motion estimation predicts future optical flow based on the audio-motion correlation. The visual branch recalls from the motion memory built from the audio features to enable better long term prediction. We further propose context-aware refinement to address the diminishing of the global appearance context in the long-term continuous warping. The global appearance context is extracted by the context encoder and manipulated by motion-conditioned affine transformation before fusion with features of warped frames. Experimental results show that our method achieves competitive results on existing benchmarks.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2022
- DOI:
- 10.48550/arXiv.2212.04679
- arXiv:
- arXiv:2212.04679
- Bibcode:
- 2022arXiv221204679X
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- BMVC 2023