Abstract
FlowRL enhances LLM reinforcement learning by matching the full reward distribution through flow balancing, improving diversity and performance over reward-maximizing methods.
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Community
- We propose FlowRL, a policy optimization algorithm that shifts from reward maximization to reward distribution matching via flow balance, encouraging diverse reasoning path exploration while addressing the inherent mode-collapse limitations of existing RL methods.
- We introduce length normalization and importance sampling to enable effective training on variable-length CoT reasoning, addressing gradient explosion and sampling mismatch issues.
- FlowRL outperforms GRPO and PPO by 10.0% and 5.1% respectively across math benchmarks and demonstrates strong generalization on code reasoning tasks, with diversity analysis confirming substantially more diverse solution exploration.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Revisiting LLM Reasoning via Information Bottleneck (2025)
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization (2025)
- Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR (2025)
- COPO: Consistency-Aware Policy Optimization (2025)
- Geometric-Mean Policy Optimization (2025)
- Inpainting-Guided Policy Optimization for Diffusion Large Language Models (2025)
- Posterior-GRPO: Rewarding Reasoning Processes in Code Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
What to feed into MLP for Z_φ(x) in decoder-only architecture?
The partition function Z_φ(x)
is implemented as a 3-layer MLP taking prompt representation x
as input. What exactly should be fed into the MLP for decoder-only models?
Options:
- Last prompt token - hidden state of the final token before generation starts
- Prompt pooling - mean/max pooling over all prompt token hidden states
- Separator token - add special token between prompt and response
Which approach is most common for this use case?
Sorry for the confused part about Log_Z! I'll detail this for you and update our paper ASAP.
From the flow perspective: Log_Z measures the probability flow from initial state S_0. Intuitively, it estimates a denominator - the sum of rewards across all possible paths, so we can convert to a distribution via reward/Z.
From the implementation perspective: Since it's the initial state, we use the prompt encoded by the LM's last layer hidden states. To convert variable-length prompts to a scalar, we empirically take the mean. There are definitely other approaches here that we haven't explored yet
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper