Toy Models of Superposition

doi:10.48550/arXiv.2209.10652

Toy Models of Superposition

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

Publication:

arXiv e-prints

Pub Date:

September 2022

DOI:

10.48550/arXiv.2209.10652

arXiv:

arXiv:2209.10652

Bibcode:

2022arXiv220910652E

Keywords:

Computer Science - Machine Learning

E-Print:

Also available at https://transformer-circuits.pub/2022/toy_model/index.html

NASA/ADS

Toy Models of Superposition

Abstract