arxiv:2511.09146

DoPE: Denoising Rotary Position Embedding

Published on Nov 12

· Submitted by

Jing Xiong on Nov 17

#1 Paper of the day

Upvote

Authors:

Zunhai Su ,

Abstract

Denoising Positional Encoding (DoPE) enhances length generalization in Transformer models by detecting and mitigating noisy frequency bands in positional embeddings, improving retrieval accuracy and reasoning stability.

AI-generated summary

Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io

View arXiv page View PDF Project page Add to collection

Community

menik1126

Paper submitter 3 days ago

Good paper

OldKingMeister

3 days ago

Good paper

Good comment!

zunhai

Paper author 3 days ago

Very insightful!

Cloudriver

3 days ago

Insightful paper, everyone should read it!

MoonTideF

3 days ago

Nice paper!

caihongyu

3 days ago

good paper！

EladofWar

2 days ago

super formula heavy, but a good read. Definitely should do some literature review to understand the difference between other developments in ROPE. The main idea I was impressed by was mitigation of the attention sink in a theoretical and then empirical manner