arxiv:2506.14758

Reasoning with Exploration: An Entropy Perspective

Published on Jun 17

· Submitted by

daixuancheng on Jun 18

Upvote

Authors:

Daixuan Cheng ,

Abstract

Introducing an entropy-based term to the advantage function in reinforcement learning enhances exploratory reasoning in language models, leading to improved performance on complex reasoning tasks.

AI-generated summary

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

View arXiv page View PDF Add to collection

Community

daixuancheng

Paper author Paper submitter about 21 hours ago

This work investigates reasoning with exploration to encourage longer and deeper reasoning chains
in LMs, through the lens of entropy.

We investigate and reveal a strong correlation between entropy and exploratory reasoning in LMs,
showing that pivotal tokens, reflective actions, and rare behaviors emerge with higher entropy.
We propose a minimal yet effective method that augments the standard RL advantage with a
clipped, gradient-detached entropy term, encouraging exploration by fostering longer and deeper
reasoning chains while preserving the original policy optimization direction.
We validate our approach on mainstream RL algorithms: GRPO and PPO, achieving substantial
improvements on the Pass@Kmetric and pushing the boundaries of LM reasoning capabilities.