Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

doi:10.48550/arXiv.2403.07887

Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is an XML-like schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured objective that reasons over the intermodal alignment. We show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. Finally, we investigate the reasoning abilities of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.

Publication:

arXiv e-prints

Pub Date:

February 2024

DOI:

10.48550/arXiv.2403.07887

arXiv:

arXiv:2403.07887

Bibcode:

2024arXiv240307887D

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence

NASA/ADS

Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Abstract