Abstract
Denoising Positional Encoding (DoPE) enhances length generalization in Transformer models by detecting and mitigating noisy frequency bands in positional embeddings, improving retrieval accuracy and reasoning stability.
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io
Community
Good paper
Good paper
Good comment!
Very insightful!
Insightful paper, everyone should read it!
Nice paper!
good paper!
super formula heavy, but a good read. Definitely should do some literature review to understand the difference between other developments in ROPE. The main idea I was impressed by was mitigation of the attention sink in a theoretical and then empirical manner
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities (2025)
- Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation (2025)
- Positional Preservation Embedding for Multimodal Large Language Models (2025)
- Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling (2025)
- Point-RTD: Replaced Token Denoising for Pretraining Transformer Models on Point Clouds (2025)
- Revisiting Multimodal Positional Encoding in Vision-Language Models (2025)
- From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Good paper
Good comment!
Good user!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper