PaliGemma: A versatile 3B VLM for transfer

doi:10.48550/arXiv.2407.07726

PaliGemma: A versatile 3B VLM for transfer

Beyer, Lucas
Steiner, Andreas
Susano Pinto, André
Kolesnikov, Alexander
Wang, Xiao
Salz, Daniel
Neumann, Maxim
Alabdulmohsin, Ibrahim
Tschannen, Michael
Bugliarello, Emanuele
Unterthiner, Thomas
Keysers, Daniel
Koppula, Skanda
Liu, Fangyu
Grycner, Adam
Gritsenko, Alexey
Houlsby, Neil
Kumar, Manoj
Rong, Keran
Eisenschlos, Julian
Kabra, Rishabh
Bauer, Matthias
Bošnjak, Matko
Chen, Xi
Minderer, Matthias
Voigtlaender, Paul
Bica, Ioana
Balazevic, Ivana
Puigcerver, Joan
Papalampidi, Pinelopi
Henaff, Olivier
Xiong, Xi
Soricut, Radu
Harmsen, Jeremiah
Zhai, Xiaohua

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

Publication:

arXiv e-prints

Pub Date:

July 2024

DOI:

10.48550/arXiv.2407.07726

arXiv:

arXiv:2407.07726

Bibcode:

2024arXiv240707726B

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence;
Computer Science - Computation and Language;
Computer Science - Machine Learning

E-Print:

v2 adds Appendix H and I and a few citations

ADS

PaliGemma: A versatile 3B VLM for transfer

Abstract