arxiv:2509.15207

FlowRL: Matching Reward Distributions for LLM Reasoning

Published on Sep 18

· Submitted by

Daixuan Cheng on Sep 19

#2 Paper of the day

Upvote

Authors:

Xuekai Zhu ,

Daixuan Cheng ,

Dinghuai Zhang ,

Yuxin Zuo ,

Xingtai Lv ,

Qizheng Zhang ,

Lin Chen ,

Zhenjie Yang ,

Ganqu Cui ,

Jianfeng Gao ,

Abstract

FlowRL enhances LLM reinforcement learning by matching the full reward distribution through flow balancing, improving diversity and performance over reward-maximizing methods.

AI-generated summary

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

View arXiv page View PDF GitHub 61 Add to collection

Community

daixuancheng

Paper author Paper submitter 5 days ago

•

edited 5 days ago

We propose FlowRL, a policy optimization algorithm that shifts from reward maximization to reward distribution matching via flow balance, encouraging diverse reasoning path exploration while addressing the inherent mode-collapse limitations of existing RL methods.
We introduce length normalization and importance sampling to enable effective training on variable-length CoT reasoning, addressing gradient explosion and sampling mismatch issues.
FlowRL outperforms GRPO and PPO by 10.0% and 5.1% respectively across math benchmarks and demonstrates strong generalization on code reasoning tasks, with diversity analysis confirming substantially more diverse solution exploration.

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

WpythonW

3 days ago

What to feed into MLP for Z_φ(x) in decoder-only architecture?

The partition function Z_φ(x) is implemented as a 3-layer MLP taking prompt representation x as input. What exactly should be fed into the MLP for decoder-only models?

Options:

Last prompt token - hidden state of the final token before generation starts
Prompt pooling - mean/max pooling over all prompt token hidden states
Separator token - add special token between prompt and response

Which approach is most common for this use case?

xuekai

Paper author 3 days ago

Sorry for the confused part about Log_Z! I'll detail this for you and update our paper ASAP.

From the flow perspective: Log_Z measures the probability flow from initial state S_0. Intuitively, it estimates a denominator - the sum of rewards across all possible paths, so we can convert to a distribution via reward/Z.

From the implementation perspective: Since it's the initial state, we use the prompt encoded by the LM's last layer hidden states. To convert variable-length prompts to a scalar, we empirically take the mean. There are definitely other approaches here that we haven't explored yet