Papers
arxiv:2506.13901

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Published on Jun 16
· Submitted by amanchadha on Jun 18
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A new evaluation metric called Alignment Quality Index (AQI) assesses the alignment of large language models by analyzing latent space activations, capturing clustering quality to detect misalignments and fake alignment, and complementing existing behavioral proxies.

AI-generated summary

Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.

Community

Paper author Paper submitter

Screenshot 2025-06-17 at 10.31.35 PM.jpg

The paper introduces the Alignment Quality Index (AQI), a decoding-invariant metric leveraging latent geometric representations and clustering indices to diagnose hidden misalignments in large language models (LLMs), even under behavioral compliance.

  • Intrinsic Latent Geometry Metric: AQI measures alignment by assessing how well safe and unsafe prompts form distinct clusters in a model’s latent space using a combination of Xie-Beni and Calinski-Harabasz indices, making it invariant to decoding strategies and resistant to alignment faking.
  • Layerwise Pooled Representation Learning: It uses a sparse, learned pooling mechanism over hidden transformer layers to capture alignment-relevant abstractions without modifying the base model, enabling robust internal safety diagnostics.
  • Empirical Failures of Behavioral Metrics: AQI reveals misalignments missed by traditional metrics (e.g., G-Eval, refusal rates) in scenarios like jailbreaks, safety-agnostic fine-tuning, and stochastic decoding—showcasing its strength as an early-warning alignment audit tool.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.13901 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.13901 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.13901 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.