Toy Models of Superposition
Abstract
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2022
- DOI:
- 10.48550/arXiv.2209.10652
- arXiv:
- arXiv:2209.10652
- Bibcode:
- 2022arXiv220910652E
- Keywords:
-
- Computer Science - Machine Learning
- E-Print:
- Also available at https://transformer-circuits.pub/2022/toy_model/index.html