Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations
Abstract
Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is an XML-like schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured objective that reasons over the intermodal alignment. We show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. Finally, we investigate the reasoning abilities of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2024
- DOI:
- 10.48550/arXiv.2403.07887
- arXiv:
- arXiv:2403.07887
- Bibcode:
- 2024arXiv240307887D
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Artificial Intelligence