Unveiling Language Skills via Path-Level Circuit Discovery

doi:10.48550/arXiv.2410.01334

Unveiling Language Skills via Path-Level Circuit Discovery

Circuit discovery with edge-level ablation has become a foundational framework for mechanism interpretability of language models. However, its focus on individual edges often overlooks the sequential, path-level causal relationships that underpin complex behaviors, thus potentially leading to misleading or incomplete circuit discoveries. To address this issue, we propose a novel path-level circuit discovery framework capturing how behaviors emerge through interconnected linear chain and build towards complex behaviors. Our framework is constructed upon a fully-disentangled linear combinations of ``memory circuits'' decomposed from the original model. To discover functional circuit paths, we leverage a 2-step pruning strategy by first reducing the computational graph to a faithful and minimal subgraph and then applying causal mediation to identify common paths of a specific skill, termed as skill paths. In contrast to circuit graph from existing works, we focus on the complete paths of a generic skill rather than on the fine-grained responses to individual components of the input. To demonstrate this, we explore three generic language skills, namely Previous Token Skill, Induction Skill and In-Context Learning Skill using our framework and provide more compelling evidence to substantiate stratification and inclusiveness of these skills.

Publication:

arXiv e-prints

Pub Date:

October 2024

DOI:

10.48550/arXiv.2410.01334

arXiv:

arXiv:2410.01334

Bibcode:

2024arXiv241001334C

Keywords:

Computer Science - Computation and Language;
Computer Science - Artificial Intelligence

E-Print:

30 pages

ADS

Unveiling Language Skills via Path-Level Circuit Discovery

Abstract