Universal Jailbreak Suffixes Are Strong Attention Hijackers
Abstract
Suffix-based jailbreaks exploit adversarial suffixes to hijack large language models, with effectiveness linked to suffix universality; the method can be enhanced and mitigated with minimal computational or utility cost.
We study suffix-based jailbreaksx2013a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack (Zou et al., 2023), we observe that suffixes vary in efficacy: some markedly more universalx2013generalizing to many unseen harmful instructionsx2013than others. We first show that GCG's effectiveness is driven by a shallow, critical mechanism, built on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG universality can be efficiently enhanced (up to times5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving attack success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.
Community
Analyzing the underlying mechanism of suffix-based LLM jailbreaks, we find it relies on aggressively hijacking the model context 🥷, the more universal a suffix the stronger its hijacking. Exploiting this, we show how to both enhance and mitigate existing attacks.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper