Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

doi:10.48550/arXiv.2407.10733

Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

In this work, we introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA), to overcome the traditional constraints associated with training segmentation models. Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries. Our approach addresses two critical challenges in self-supervised learning: 1) extracting comprehensive representations for universal image segmentation from a pixel decoder, and 2) effectively training the transformer decoder. The use of the transformer decoder as a predictor within the JEPA framework allows proficient training in universal image segmentation tasks. Through rigorous evaluations on datasets such as ADE20K, Cityscapes and COCO, Mask-JEPA demonstrates not only competitive results but also exceptional adaptability and robustness across various training scenarios. The architecture-agnostic nature of Mask-JEPA further underscores its versatility, allowing seamless adaptation to various mask classification family.

Publication:

arXiv e-prints

Pub Date:

July 2024

DOI:

10.48550/arXiv.2407.10733

arXiv:

arXiv:2407.10733

Bibcode:

2024arXiv240710733K

Keywords:

Computer Science - Computer Vision and Pattern Recognition

E-Print:

27 pages, 5 figures

NASA/ADS

Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

Abstract