Sound Localization by Self-Supervised Time Delay Estimation

doi:10.48550/arXiv.2204.12489

Sound Localization by Self-Supervised Time Delay Estimation

Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. Project site: https://ificl.github.io/stereocrw/

Publication:

arXiv e-prints

Pub Date:

April 2022

DOI:

10.48550/arXiv.2204.12489

arXiv:

arXiv:2204.12489

Bibcode:

2022arXiv220412489C

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Sound;
Electrical Engineering and Systems Science - Audio and Speech Processing

E-Print:

ECCV 2022

NASA/ADS

Sound Localization by Self-Supervised Time Delay Estimation

Abstract