Improve model card: Add detailed framework and results sections (#2)
Browse files- Improve model card: Add detailed framework and results sections (184401420a7eb6095c5d50a1f609cb9ed46b9171)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,5 +1,6 @@
|
|
1 |
---
|
2 |
base_model: Qwen/Qwen2.5-3B-Instruct
|
|
|
3 |
license: apache-2.0
|
4 |
pipeline_tag: text-generation
|
5 |
tags:
|
@@ -11,7 +12,6 @@ tags:
|
|
11 |
- reasoning
|
12 |
- verl
|
13 |
paper: https://huggingface.co/papers/2507.14295
|
14 |
-
library_name: transformers
|
15 |
---
|
16 |
|
17 |
# Qwen2.5-3B-UFO
|
@@ -19,9 +19,30 @@ library_name: transformers
|
|
19 |
This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning, as presented in the paper [A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning](https://huggingface.co/papers/2507.14295).
|
20 |
|
21 |
Github: https://github.com/lichengliu03/unary-feedback
|
22 |
-
|
23 |
Website: https://unary-feedback.github.io/
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
## Model Info
|
26 |
|
27 |
- **Base model**: Qwen/Qwen2.5-3B-Instruct
|
@@ -42,6 +63,66 @@ Website: https://unary-feedback.github.io/
|
|
42 |
- **Clip Ratio**: 0.2-0.28
|
43 |
- **Temperature**: 1.0 (train), 0.5 (eval)
|
44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
## Usage
|
46 |
|
47 |
```python
|
@@ -96,6 +177,52 @@ This model is optimized for mathematical reasoning with PPO, and compared to the
|
|
96 |
- May not perform as well on general tasks
|
97 |
- Recommended for math, logic, and reasoning tasks
|
98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
## License
|
100 |
|
101 |
-
This model is licensed under Apache 2.0.
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
base_model: Qwen/Qwen2.5-3B-Instruct
|
3 |
+
library_name: transformers
|
4 |
license: apache-2.0
|
5 |
pipeline_tag: text-generation
|
6 |
tags:
|
|
|
12 |
- reasoning
|
13 |
- verl
|
14 |
paper: https://huggingface.co/papers/2507.14295
|
|
|
15 |
---
|
16 |
|
17 |
# Qwen2.5-3B-UFO
|
|
|
19 |
This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning, as presented in the paper [A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning](https://huggingface.co/papers/2507.14295).
|
20 |
|
21 |
Github: https://github.com/lichengliu03/unary-feedback
|
|
|
22 |
Website: https://unary-feedback.github.io/
|
23 |
|
24 |
+
## Overview
|
25 |
+
|
26 |
+
**"Let's Try Again"** addresses a critical gap in language model training: while single-turn reinforcement learning (RL) improves reasoning, these models fail in **multi-turn interactive scenarios**, often repeating the same wrong answers despite feedback.
|
27 |
+
|
28 |
+
### Key Problem
|
29 |
+
Single-turn RL models lose the ability to revise reasoning across multiple turns. In 70% of failure cases, they produce identical answers across 5 interaction rounds, unable to incorporate simple feedback like "try again."
|
30 |
+
|
31 |
+
### Solution: UFO Framework
|
32 |
+
**Unary Feedback as Observation (UFO)** transforms static datasets into multi-turn training by:
|
33 |
+
- Using only minimal feedback signals ("Try Again")
|
34 |
+
- Treating failure feedback as part of the observation
|
35 |
+
- Enabling models to learn from historical mistakes
|
36 |
+
|
37 |
+
### Results
|
38 |
+
- **14% improvement** in multi-turn success rates
|
39 |
+
- **10% reduction** in average interaction turns
|
40 |
+
- Better performance even in single-turn scenarios
|
41 |
+
- **90% non-repetitive answers** (vs 80% baseline)
|
42 |
+
|
43 |
+
### Impact
|
44 |
+
UFO enables effective multi-turn RL training on existing static datasets without expensive annotations, making it practical to train models that can learn from sparse feedback and improve iteratively through trial-and-error, just like humans do.
|
45 |
+
|
46 |
## Model Info
|
47 |
|
48 |
- **Base model**: Qwen/Qwen2.5-3B-Instruct
|
|
|
63 |
- **Clip Ratio**: 0.2-0.28
|
64 |
- **Temperature**: 1.0 (train), 0.5 (eval)
|
65 |
|
66 |
+
## UFO Framework Details
|
67 |
+
|
68 |
+
The UFO framework transforms static single-turn datasets into multi-turn interactive training through a simple yet effective approach.
|
69 |
+
|
70 |
+
<p align="center"><img src="public/fig1.png" width="800px" alt="UFO Framework Flow" /></p>
|
71 |
+
<p align="center" style="font-size: 16px; max-width: 800px; margin: 0 auto;">
|
72 |
+
The UFO framework flow: Static datasets are transformed into multi-turn episodes where models receive minimal feedback ("Try Again") and learn to revise their reasoning across multiple attempts.
|
73 |
+
</p>
|
74 |
+
|
75 |
+
### Problem Formulation
|
76 |
+
|
77 |
+
We model multi-turn problem solving as a finite-horizon Markov Decision Process (MDP) where:
|
78 |
+
- **State**: Encodes the original question and history of past attempts with feedback
|
79 |
+
- **Action**: All possible answers the model can generate
|
80 |
+
- **Reward**: Binary signal (1 for correct, 0 for incorrect)
|
81 |
+
- **Transition**: Agent generates answer, receives feedback, episode continues until success or max turns
|
82 |
+
|
83 |
+
### Unary Feedback as Observation (UFO)
|
84 |
+
|
85 |
+
The core innovation is treating minimal feedback as part of the observation:
|
86 |
+
|
87 |
+
```
|
88 |
+
Question: What is the value of x + y?
|
89 |
+
Attempt 1: [wrong answer]
|
90 |
+
Feedback: Try Again.
|
91 |
+
Attempt 2: [correct answer]
|
92 |
+
```
|
93 |
+
|
94 |
+
**Key Features:**
|
95 |
+
- Only **negative feedback** (e.g., "Try Again") is included in context
|
96 |
+
- No positive confirmation signals are ever shown
|
97 |
+
- Model must learn to revise based solely on failure history
|
98 |
+
- Episodes terminate immediately upon correct answer
|
99 |
+
|
100 |
+
### Training with PPO
|
101 |
+
|
102 |
+
We use Proximal Policy Optimization (PPO) to train the policy:
|
103 |
+
- Agent observes input with full interaction history
|
104 |
+
- Generates answer and receives binary reward
|
105 |
+
- Policy updates using clipped surrogate objective
|
106 |
+
- Value function provides advantage estimates for stable training
|
107 |
+
|
108 |
+
### Reward Design
|
109 |
+
|
110 |
+
Two complementary strategies encourage efficient reasoning:
|
111 |
+
|
112 |
+
**1. Exponential Reward Decay:**
|
113 |
+
```
|
114 |
+
DecayReward(t) = 纬^t if correct, 0 otherwise
|
115 |
+
```
|
116 |
+
Favors solving problems in fewer turns.
|
117 |
+
|
118 |
+
**2. Repetition Penalty:**
|
119 |
+
```
|
120 |
+
Penalty(蟿) = 位 路 (1 - E(蟿)/T)
|
121 |
+
```
|
122 |
+
Penalizes duplicate answers, encouraging diverse reasoning strategies.
|
123 |
+
|
124 |
+
This framework enables effective multi-turn RL training on static datasets without requiring expensive annotations or complex environments.
|
125 |
+
|
126 |
## Usage
|
127 |
|
128 |
```python
|
|
|
177 |
- May not perform as well on general tasks
|
178 |
- Recommended for math, logic, and reasoning tasks
|
179 |
|
180 |
+
## Key Results
|
181 |
+
|
182 |
+
### Multi-Turn Reasoning Performance
|
183 |
+
|
184 |
+
We compare our multi-turn UFO model against a strong single-turn PPO baseline. For a fair comparison, the baseline is evaluated on 5 independent samples (Pass@5), while our model uses 5 sequential attempts with feedback (Succ@5). Success is recorded if any of the 5 responses is correct. We also analyze the impact of varying the maximum number of interaction turns at training.
|
185 |
+
|
186 |
+
<p align="center">
|
187 |
+
<img src="public/compare_baseline.png" width="46.2%" alt="UFO Performance Comparison" />
|
188 |
+
<img src="public/multi-turn_training.png" width="45%" alt="Multi-turn Training Process" />
|
189 |
+
</p>
|
190 |
+
<p align="center" style="font-size: 14px; color: #666;">
|
191 |
+
Left: Multi-turn (5-turn) RL significantly outperforms single-turn baseline. Right: Performance comparison with different training turns (1, 5, and 10).
|
192 |
+
</p>
|
193 |
+
|
194 |
+
**Key Findings:**
|
195 |
+
- **+14% success rate** over single-turn PPO baseline
|
196 |
+
- Benefits generalize to both multi-turn and single-turn inference
|
197 |
+
- Best results with 5-turn training; more turns yield diminishing returns
|
198 |
+
|
199 |
+
### Effectiveness of Unary Feedback
|
200 |
+
|
201 |
+
To further investigate the role of unary feedback, we compare model performance under different feedback availability conditions. In scenario (a), unary feedback is provided during both training and validation phases, while in scenario (b), unary feedback is available only during training but not at validation. The results show that access to unary feedback during both phases substantially improves validation success rate. In contrast, providing unary feedback solely during training does not yield improvements, indicating that the benefit of unary feedback is contingent on its availability at inference time.
|
202 |
+
|
203 |
+
<p align="center"><img src="public/feedback_comparisons_side_by_side.png" width="80%" alt="Effectiveness of Unary Feedback" /></p>
|
204 |
+
<p align="center" style="font-size: 14px; color: #666;">
|
205 |
+
Success rate comparison under different unary feedback settings: (a) feedback in both training and validation; (b) feedback only in training.
|
206 |
+
</p>
|
207 |
+
|
208 |
+
**Key Insights:**
|
209 |
+
- Feedback in both training and validation is crucial for improvement
|
210 |
+
- Feedback only in training phase does **not** help at inference
|
211 |
+
|
212 |
+
### Reward Design Impact
|
213 |
+
|
214 |
+
**Exponential Reward Decay:**
|
215 |
+
- Decreases the average number of actions required to solve problems by ~10%
|
216 |
+
- Encourages faster and more efficient problem solving
|
217 |
+
|
218 |
+
**Answer Diversity:**
|
219 |
+
- Non-repetitive answer ratio increases from 79.7% to 92.8%
|
220 |
+
- Multi-turn RL with UFO encourages answer diversity and strengthens robustness
|
221 |
+
|
222 |
## License
|
223 |
|
224 |
+
This model is licensed under Apache 2.0.
|
225 |
+
|
226 |
+
## Acknowledgements
|
227 |
+
|
228 |
+
We thank the [DeepSeek](https://github.com/deepseek-ai/DeepSeek-R1) team for providing the DeepSeek-R1 model and early conceptual inspirations. We are grateful to the [veRL](https://github.com/volcengine/verl) team for their infrastructure support and the [RAGEN](https://github.com/RAGEN-AI/RAGEN) team for their multi-turn RL framework.
|