Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions
Abstract
We introduce a new type of indirect injection attacks against language models that operate on images: hidden ''meta-instructions'' that influence how the model interprets the image and steer the model's outputs to express an adversary-chosen style, sentiment, or point of view. We explain how to create meta-instructions by generating images that act as soft prompts. In contrast to jailbreaking attacks and adversarial examples, outputs produced in response to these images are plausible and based on the visual content of the image, yet also satisfy the adversary's (meta-)objective. We evaluate the efficacy of meta-instructions for multiple visual language models and adversarial meta-objectives, and demonstrate how they can ''unlock'' capabilities of the underlying language models that are unavailable via explicit text instructions. We describe how meta-instruction attacks could cause harm by enabling creation of malicious, self-interpreting content that carries spam, misinformation, and spin. Finally, we discuss defenses.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2024
- DOI:
- 10.48550/arXiv.2407.08970
- arXiv:
- arXiv:2407.08970
- Bibcode:
- 2024arXiv240708970Z
- Keywords:
-
- Computer Science - Cryptography and Security;
- Computer Science - Artificial Intelligence;
- Computer Science - Machine Learning