Interpreting Pretrained Language Models via Concept Bottlenecks

doi:10.48550/arXiv.2311.05014

Interpreting Pretrained Language Models via Concept Bottlenecks

Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. However, the lack of interpretability due to their ``black-box'' nature poses challenges for responsible implementation. Although previous studies have attempted to improve interpretability by using, e.g., attention weights in self-attention layers, these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of ``Food'' and investigate how it influences the prediction of a model's sentiment towards a restaurant review. We introduce C$^3$M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we manifest that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.

Publication:

arXiv e-prints

Pub Date:

November 2023

DOI:

10.48550/arXiv.2311.05014

arXiv:

arXiv:2311.05014

Bibcode:

2023arXiv231105014T

Keywords:

Computer Science - Computation and Language;
Computer Science - Artificial Intelligence

NASA/ADS

Interpreting Pretrained Language Models via Concept Bottlenecks

Abstract