Learning in Audio-visual Context: A Review, Analysis, and New Perspective
Abstract
Sight and hearing are two senses that play a vital role in human communication and scene understanding. To mimic human perception ability, audio-visual learning, aimed at developing computational approaches to learn from both audio and visual modalities, has been a flourishing field in recent years. A comprehensive survey that can systematically organize and analyze studies of the audio-visual field is expected. Starting from the analysis of audio-visual cognition foundations, we introduce several key findings that have inspired our computational studies. Then, we systematically review the recent audio-visual learning studies and divide them into three categories: audio-visual boosting, cross-modal perception and audio-visual collaboration. Through our analysis, we discover that, the consistency of audio-visual data across semantic, spatial and temporal support the above studies. To revisit the current development of the audio-visual learning field from a more macro view, we further propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area. Overall, this survey reviews and outlooks the current audio-visual learning field from different aspects. We hope it can provide researchers with a better understanding of this area. A website including constantly-updated survey is released: \url{https://gewu-lab.github.io/audio-visual-learning/}.
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2022
- DOI:
- 10.48550/arXiv.2208.09579
- arXiv:
- arXiv:2208.09579
- Bibcode:
- 2022arXiv220809579W
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Artificial Intelligence;
- I.2.10;
- I.4.8;
- I.5
- E-Print:
- https://gewu-lab.github.io/audio-visual-learning/