Unveiling Language Skills via Path-Level Circuit Discovery
Abstract
Circuit discovery with edge-level ablation has become a foundational framework for mechanism interpretability of language models. However, its focus on individual edges often overlooks the sequential, path-level causal relationships that underpin complex behaviors, thus potentially leading to misleading or incomplete circuit discoveries. To address this issue, we propose a novel path-level circuit discovery framework capturing how behaviors emerge through interconnected linear chain and build towards complex behaviors. Our framework is constructed upon a fully-disentangled linear combinations of ``memory circuits'' decomposed from the original model. To discover functional circuit paths, we leverage a 2-step pruning strategy by first reducing the computational graph to a faithful and minimal subgraph and then applying causal mediation to identify common paths of a specific skill, termed as skill paths. In contrast to circuit graph from existing works, we focus on the complete paths of a generic skill rather than on the fine-grained responses to individual components of the input. To demonstrate this, we explore three generic language skills, namely Previous Token Skill, Induction Skill and In-Context Learning Skill using our framework and provide more compelling evidence to substantiate stratification and inclusiveness of these skills.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2024
- DOI:
- arXiv:
- arXiv:2410.01334
- Bibcode:
- 2024arXiv241001334C
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Artificial Intelligence
- E-Print:
- 30 pages