TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs
Abstract
TraceAlign is a framework that identifies and mitigates alignment drift in LLMs by tracing unsafe completions to their training sources and applying interventions to reduce drift while maintaining utility.
Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7
Community
A principled framework that structurally decomposes LoRA fine-tuning updates into alignment-critical and task-specific components using Fisher Information and geodesic constraints, achieving alignment preservation with minimal utility loss.
โก๏ธ ๐๐๐ฒ ๐๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐ ๐๐ฅ๐ข๐ ๐ง๐๐ฎ๐๐ซ๐-๐๐จ๐๐:
๐งช ๐
๐ข๐ฌ๐ก๐๐ซ-๐๐ฎ๐ข๐๐๐ ๐๐๐๐จ๐ฆ๐ฉ๐จ๐ฌ๐ข๐ญ๐ข๐จ๐ง:
LoRA updates are orthogonally decomposed into alignment-critical (โWA) and task-specific (โWT) components via eigen-decomposed Fisher Information Matrix (FIM), enabling selective regularization along high-curvature safety-sensitive directions.
๐งฉ ๐๐จ๐ฅ๐ฅ๐ข๐ฌ๐ข๐จ๐ง-๐๐ฐ๐๐ซ๐ ๐๐๐ ๐ฎ๐ฅ๐๐ซ๐ข๐ณ๐๐ญ๐ข๐จ๐ง (๐๐ + ๐๐๐จ):
Introduces dual-mode regularization using local Riemannian overlap and global geodesic separation between โWA and โWT to prevent update interference and latent entanglement, ensuring structural disentanglement of safety and utility.
๐ง ๐๐๐๐
๐๐๐๐๐๐ & ๐
๐จ๐ซ๐ ๐๐ญ๐ญ๐ข๐ง๐ ๐๐๐๐ฅ๐ข๐ง๐ ๐๐๐ฐ:
Defines a new benchmark, DRIFTCHECK, for evaluating alignment drift and validates a modified scaling law that characterizes and reduces catastrophic forgetting in alignment-sensitive subspaces, showing up to 50% improvement in alignment retention without task degradation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Probing the Robustness of Large Language Models Safety to Latent Perturbations (2025)
- AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning (2025)
- Automating Steering for Safe Multimodal Large Language Models (2025)
- QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA (2025)
- Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs (2025)
- MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning (2025)
- GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper