Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
Abstract
Engaging in the deliberate generation of abnormal outputs from large language models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We relate and connect this activity between its practitioners' motivations and goals; the strategies and techniques they deploy; and the crucial role the community plays. As a result, this paper presents a grounded theory of how and why people attack large language models: LLM red teaming in the wild.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2023
- DOI:
- 10.48550/arXiv.2311.06237
- arXiv:
- arXiv:2311.06237
- Bibcode:
- 2023arXiv231106237I
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Cryptography and Security;
- Computer Science - Human-Computer Interaction