Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
Abstract
The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.
Community
TL;DR: LLMs' safety vulnerabilities stem from over-reliance on fixed response templates, exposing them to jailbreak attacks, but robustness can be enhanced by detaching safety mechanisms from these templates.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Safety Alignment Depth in Large Language Models: A Markov Chain Perspective (2025)
- Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing (2025)
- Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense (2025)
- JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation (2025)
- Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions (2025)
- Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models (2025)
- Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper