Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

doi:10.48550/arXiv.2309.00751

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

Due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. Despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. In this work, we apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence using feature attribution methods. We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.

Publication:

arXiv e-prints

Pub Date:

September 2023

DOI:

10.48550/arXiv.2309.00751

arXiv:

arXiv:2309.00751

Bibcode:

2023arXiv230900751S

Keywords:

Computer Science - Computation and Language

E-Print:

4 pages

ADS

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

Abstract