Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. (2024) focuses on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search. Nevertheless, this method lacks generality since it specifies the instruction-response structure. Moreover, the reason why inserting special tokens takes effect in inducing harmful behaviors is only empirically discussed. In this paper, we take a deeper insight into the mechanism of special token injection and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at https://github.com/iphosi/Self-Instruct-FSJ.

Publication:

arXiv e-prints

Pub Date:

January 2025

arXiv:

arXiv:2501.07959

Bibcode:

2025arXiv250107959H

Keywords:

Computer Science - Artificial Intelligence

ADS

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Abstract