arxiv:2508.02063

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Published on Aug 4

· Submitted by

amanchadha on Aug 6

Upvote

Authors:

Aman Chadha

Abstract

TraceAlign is a framework that identifies and mitigates alignment drift in LLMs by tracing unsafe completions to their training sources and applying interventions to reduce drift while maintaining utility.

AI-generated summary

Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter 15 days ago

A principled framework that structurally decomposes LoRA fine-tuning updates into alignment-critical and task-specific components using Fisher Information and geodesic constraints, achieving alignment preservation with minimal utility loss.

➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐀𝐥𝐢𝐠𝐧𝐆𝐮𝐚𝐫𝐝-𝐋𝐨𝐑𝐀:
🧪 𝐅𝐢𝐬𝐡𝐞𝐫-𝐆𝐮𝐢𝐝𝐞𝐝 𝐃𝐞𝐜𝐨𝐦𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧:
LoRA updates are orthogonally decomposed into alignment-critical (∆WA) and task-specific (∆WT) components via eigen-decomposed Fisher Information Matrix (FIM), enabling selective regularization along high-curvature safety-sensitive directions.

🧩 𝐂𝐨𝐥𝐥𝐢𝐬𝐢𝐨𝐧-𝐀𝐰𝐚𝐫𝐞 𝐑𝐞𝐠𝐮𝐥𝐚𝐫𝐢𝐳𝐚𝐭𝐢𝐨𝐧 (𝐑𝐌 + 𝐆𝐞𝐨):
Introduces dual-mode regularization using local Riemannian overlap and global geodesic separation between ∆WA and ∆WT to prevent update interference and latent entanglement, ensuring structural disentanglement of safety and utility.

🧠 𝐃𝐑𝐈𝐅𝐓𝐂𝐇𝐄𝐂𝐊 & 𝐅𝐨𝐫𝐠𝐞𝐭𝐭𝐢𝐧𝐠 𝐒𝐜𝐚𝐥𝐢𝐧𝐠 𝐋𝐚𝐰:
Defines a new benchmark, DRIFTCHECK, for evaluating alignment drift and validates a modified scaling law that characterizes and reduces catastrophic forgetting in alignment-sensitive subspaces, showing up to 50% improvement in alignment retention without task degradation.

librarian-bot

14 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.02063 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.02063 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.02063 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.