Papers
arxiv:2508.02063

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Published on Aug 4
ยท Submitted by amanchadha on Aug 6
Authors:
,
,

Abstract

TraceAlign is a framework that identifies and mitigates alignment drift in LLMs by tracing unsafe completions to their training sources and applying interventions to reduce drift while maintaining utility.

AI-generated summary

Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7

Community

Paper author Paper submitter

A principled framework that structurally decomposes LoRA fine-tuning updates into alignment-critical and task-specific components using Fisher Information and geodesic constraints, achieving alignment preservation with minimal utility loss.

โžก๏ธ ๐Š๐ž๐ฒ ๐‡๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐Ÿ ๐€๐ฅ๐ข๐ ๐ง๐†๐ฎ๐š๐ซ๐-๐‹๐จ๐‘๐€:
๐Ÿงช ๐…๐ข๐ฌ๐ก๐ž๐ซ-๐†๐ฎ๐ข๐๐ž๐ ๐ƒ๐ž๐œ๐จ๐ฆ๐ฉ๐จ๐ฌ๐ข๐ญ๐ข๐จ๐ง:
LoRA updates are orthogonally decomposed into alignment-critical (โˆ†WA) and task-specific (โˆ†WT) components via eigen-decomposed Fisher Information Matrix (FIM), enabling selective regularization along high-curvature safety-sensitive directions.

๐Ÿงฉ ๐‚๐จ๐ฅ๐ฅ๐ข๐ฌ๐ข๐จ๐ง-๐€๐ฐ๐š๐ซ๐ž ๐‘๐ž๐ ๐ฎ๐ฅ๐š๐ซ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง (๐‘๐Œ + ๐†๐ž๐จ):
Introduces dual-mode regularization using local Riemannian overlap and global geodesic separation between โˆ†WA and โˆ†WT to prevent update interference and latent entanglement, ensuring structural disentanglement of safety and utility.

๐Ÿง  ๐ƒ๐‘๐ˆ๐…๐“๐‚๐‡๐„๐‚๐Š & ๐…๐จ๐ซ๐ ๐ž๐ญ๐ญ๐ข๐ง๐  ๐’๐œ๐š๐ฅ๐ข๐ง๐  ๐‹๐š๐ฐ:
Defines a new benchmark, DRIFTCHECK, for evaluating alignment drift and validates a modified scaling law that characterizes and reduces catastrophic forgetting in alignment-sensitive subspaces, showing up to 50% improvement in alignment retention without task degradation.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.02063 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.02063 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.02063 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.