Mix and Localize: Localizing Sound Sources in Mixtures
Abstract
We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods. Project site: https://hxixixh.github.io/mix-and-localize
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2022
- DOI:
- 10.48550/arXiv.2211.15058
- arXiv:
- arXiv:2211.15058
- Bibcode:
- 2022arXiv221115058H
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- CVPR 2022