Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection
Abstract
Generic Boundary Detection (GBD) aims at locating the general boundaries that divide videos into semantically coherent and taxonomy-free units, and could serve as an important pre-processing step for long-form video understanding. Previous works often separately handle these different types of generic boundaries with specific designs of deep networks from simple CNN to LSTM. Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs. The core design is to introduce a small set of latent feature queries as anchors to compress the redundant video input into a fixed dimension via cross-attention blocks. Thanks to this fixed number of latent units, it greatly reduces the quadratic complexity of attention operation to a linear form of input frames. Specifically, to explicitly leverage the temporal structure of videos, we construct two types of latent feature queries: boundary queries and context queries, which handle the semantic incoherence and coherence accordingly. Moreover, to guide the learning of latent feature queries, we propose an alignment loss on the cross-attention maps to explicitly encourage the boundary queries to attend on the top boundary candidates. Finally, we present a sparse detection head on the compressed representation, and directly output the final boundary detection results without any post-processing module. We test our Temporal Perceiver on a variety of GBD benchmarks. Our method obtains the state-of-the-art results on all benchmarks with RGB single-stream features: SoccerNet-v2 (81.9% avg-mAP), Kinetics-GEBD (86.0% avg-f1), TAPOS (73.2% avg-f1), MovieScenes (51.9% AP and 53.1% Miou) and MovieNet (53.3% AP and 53.2% Miou), demonstrating the generalization ability of our Temporal Perceiver.
- Publication:
-
arXiv e-prints
- Pub Date:
- March 2022
- DOI:
- 10.48550/arXiv.2203.00307
- arXiv:
- arXiv:2203.00307
- Bibcode:
- 2022arXiv220300307T
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition