SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

doi:10.48550/arXiv.2406.09462

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.

Publication:

arXiv e-prints

Pub Date:

June 2024

DOI:

10.48550/arXiv.2406.09462

arXiv:

arXiv:2406.09462

Bibcode:

2024arXiv240609462V

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence

NASA/ADS

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

Abstract