Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions

doi:10.48550/arXiv.2407.08970

Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions

We introduce a new type of indirect injection attacks against language models that operate on images: hidden ''meta-instructions'' that influence how the model interprets the image and steer the model's outputs to express an adversary-chosen style, sentiment, or point of view. We explain how to create meta-instructions by generating images that act as soft prompts. In contrast to jailbreaking attacks and adversarial examples, outputs produced in response to these images are plausible and based on the visual content of the image, yet also satisfy the adversary's (meta-)objective. We evaluate the efficacy of meta-instructions for multiple visual language models and adversarial meta-objectives, and demonstrate how they can ''unlock'' capabilities of the underlying language models that are unavailable via explicit text instructions. We describe how meta-instruction attacks could cause harm by enabling creation of malicious, self-interpreting content that carries spam, misinformation, and spin. Finally, we discuss defenses.

Publication:

arXiv e-prints

Pub Date:

July 2024

DOI:

10.48550/arXiv.2407.08970

arXiv:

arXiv:2407.08970

Bibcode:

2024arXiv240708970Z

Keywords:

Computer Science - Cryptography and Security;
Computer Science - Artificial Intelligence;
Computer Science - Machine Learning

NASA/ADS

Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions

Abstract