Discovering Variable Binding Circuitry with Desiderata

doi:10.48550/arXiv.2307.03637

Discovering Variable Binding Circuitry with Desiderata

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

Publication:

arXiv e-prints

Pub Date:

July 2023

DOI:

10.48550/arXiv.2307.03637

arXiv:

arXiv:2307.03637

Bibcode:

2023arXiv230703637D

Keywords:

Computer Science - Artificial Intelligence

NASA/ADS

Discovering Variable Binding Circuitry with Desiderata

Abstract