arxiv:2508.18884

HAEPO: History-Aggregated Exploratory Policy Optimization

Published on Aug 26

Authors:

Abstract

HAEPO, a history-aware exploratory loss, enhances exploration in reinforcement learning and large language models by aggregating trajectory log-likelihoods and using Plackett-Luce softmax, leading to robust learning behavior.

AI-generated summary

Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model's decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to prevent premature collapse and a soft KL penalty relative to a frozen copy of the previous (reference) policy. Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO across diverse tasks. Thus, HAEPO provides a stable and interpretable framework by explicitly leveraging full-trajectory history while balancing exploration and stability.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.18884 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.18884 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.18884 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.