Papers
arxiv:2509.15207

FlowRL: Matching Reward Distributions for LLM Reasoning

Published on Sep 18
· Submitted by Daixuan Cheng on Sep 19
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

FlowRL enhances LLM reinforcement learning by matching the full reward distribution through flow balancing, improving diversity and performance over reward-maximizing methods.

AI-generated summary

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

Community

Paper author Paper submitter
edited 5 days ago
  • We propose FlowRL, a policy optimization algorithm that shifts from reward maximization to reward distribution matching via flow balance, encouraging diverse reasoning path exploration while addressing the inherent mode-collapse limitations of existing RL methods.
  • We introduce length normalization and importance sampling to enable effective training on variable-length CoT reasoning, addressing gradient explosion and sampling mismatch issues.
  • FlowRL outperforms GRPO and PPO by 10.0% and 5.1% respectively across math benchmarks and demonstrates strong generalization on code reasoning tasks, with diversity analysis confirming substantially more diverse solution exploration.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

What to feed into MLP for Z_φ(x) in decoder-only architecture?

The partition function Z_φ(x) is implemented as a 3-layer MLP taking prompt representation x as input. What exactly should be fed into the MLP for decoder-only models?

Options:

  1. Last prompt token - hidden state of the final token before generation starts
  2. Prompt pooling - mean/max pooling over all prompt token hidden states
  3. Separator token - add special token between prompt and response

Which approach is most common for this use case?

·
Paper author

Sorry for the confused part about Log_Z! I'll detail this for you and update our paper ASAP.

From the flow perspective: Log_Z measures the probability flow from initial state S_0. Intuitively, it estimates a denominator - the sum of rewards across all possible paths, so we can convert to a distribution via reward/Z.

From the implementation perspective: Since it's the initial state, we use the prompt encoded by the LM's last layer hidden states. To convert variable-length prompts to a scalar, we empirically take the mean. There are definitely other approaches here that we haven't explored yet

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.15207 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.15207 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.15207 in a Space README.md to link it from this page.

Collections including this paper 10