LichengLiu03 nielsr HF Staff commited on
Commit
555fe4c
verified
1 Parent(s): 97eaad3

Improve model card: Add detailed framework and results sections (#2)

Browse files

- Improve model card: Add detailed framework and results sections (184401420a7eb6095c5d50a1f609cb9ed46b9171)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +130 -3
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
  base_model: Qwen/Qwen2.5-3B-Instruct
 
3
  license: apache-2.0
4
  pipeline_tag: text-generation
5
  tags:
@@ -11,7 +12,6 @@ tags:
11
  - reasoning
12
  - verl
13
  paper: https://huggingface.co/papers/2507.14295
14
- library_name: transformers
15
  ---
16
 
17
  # Qwen2.5-3B-UFO
@@ -19,9 +19,30 @@ library_name: transformers
19
  This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning, as presented in the paper [A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning](https://huggingface.co/papers/2507.14295).
20
 
21
  Github: https://github.com/lichengliu03/unary-feedback
22
-
23
  Website: https://unary-feedback.github.io/
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ## Model Info
26
 
27
  - **Base model**: Qwen/Qwen2.5-3B-Instruct
@@ -42,6 +63,66 @@ Website: https://unary-feedback.github.io/
42
  - **Clip Ratio**: 0.2-0.28
43
  - **Temperature**: 1.0 (train), 0.5 (eval)
44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ## Usage
46
 
47
  ```python
@@ -96,6 +177,52 @@ This model is optimized for mathematical reasoning with PPO, and compared to the
96
  - May not perform as well on general tasks
97
  - Recommended for math, logic, and reasoning tasks
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  ## License
100
 
101
- This model is licensed under Apache 2.0.
 
 
 
 
 
1
  ---
2
  base_model: Qwen/Qwen2.5-3B-Instruct
3
+ library_name: transformers
4
  license: apache-2.0
5
  pipeline_tag: text-generation
6
  tags:
 
12
  - reasoning
13
  - verl
14
  paper: https://huggingface.co/papers/2507.14295
 
15
  ---
16
 
17
  # Qwen2.5-3B-UFO
 
19
  This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning, as presented in the paper [A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning](https://huggingface.co/papers/2507.14295).
20
 
21
  Github: https://github.com/lichengliu03/unary-feedback
 
22
  Website: https://unary-feedback.github.io/
23
 
24
+ ## Overview
25
+
26
+ **"Let's Try Again"** addresses a critical gap in language model training: while single-turn reinforcement learning (RL) improves reasoning, these models fail in **multi-turn interactive scenarios**, often repeating the same wrong answers despite feedback.
27
+
28
+ ### Key Problem
29
+ Single-turn RL models lose the ability to revise reasoning across multiple turns. In 70% of failure cases, they produce identical answers across 5 interaction rounds, unable to incorporate simple feedback like "try again."
30
+
31
+ ### Solution: UFO Framework
32
+ **Unary Feedback as Observation (UFO)** transforms static datasets into multi-turn training by:
33
+ - Using only minimal feedback signals ("Try Again")
34
+ - Treating failure feedback as part of the observation
35
+ - Enabling models to learn from historical mistakes
36
+
37
+ ### Results
38
+ - **14% improvement** in multi-turn success rates
39
+ - **10% reduction** in average interaction turns
40
+ - Better performance even in single-turn scenarios
41
+ - **90% non-repetitive answers** (vs 80% baseline)
42
+
43
+ ### Impact
44
+ UFO enables effective multi-turn RL training on existing static datasets without expensive annotations, making it practical to train models that can learn from sparse feedback and improve iteratively through trial-and-error, just like humans do.
45
+
46
  ## Model Info
47
 
48
  - **Base model**: Qwen/Qwen2.5-3B-Instruct
 
63
  - **Clip Ratio**: 0.2-0.28
64
  - **Temperature**: 1.0 (train), 0.5 (eval)
65
 
66
+ ## UFO Framework Details
67
+
68
+ The UFO framework transforms static single-turn datasets into multi-turn interactive training through a simple yet effective approach.
69
+
70
+ <p align="center"><img src="public/fig1.png" width="800px" alt="UFO Framework Flow" /></p>
71
+ <p align="center" style="font-size: 16px; max-width: 800px; margin: 0 auto;">
72
+ The UFO framework flow: Static datasets are transformed into multi-turn episodes where models receive minimal feedback ("Try Again") and learn to revise their reasoning across multiple attempts.
73
+ </p>
74
+
75
+ ### Problem Formulation
76
+
77
+ We model multi-turn problem solving as a finite-horizon Markov Decision Process (MDP) where:
78
+ - **State**: Encodes the original question and history of past attempts with feedback
79
+ - **Action**: All possible answers the model can generate
80
+ - **Reward**: Binary signal (1 for correct, 0 for incorrect)
81
+ - **Transition**: Agent generates answer, receives feedback, episode continues until success or max turns
82
+
83
+ ### Unary Feedback as Observation (UFO)
84
+
85
+ The core innovation is treating minimal feedback as part of the observation:
86
+
87
+ ```
88
+ Question: What is the value of x + y?
89
+ Attempt 1: [wrong answer]
90
+ Feedback: Try Again.
91
+ Attempt 2: [correct answer]
92
+ ```
93
+
94
+ **Key Features:**
95
+ - Only **negative feedback** (e.g., "Try Again") is included in context
96
+ - No positive confirmation signals are ever shown
97
+ - Model must learn to revise based solely on failure history
98
+ - Episodes terminate immediately upon correct answer
99
+
100
+ ### Training with PPO
101
+
102
+ We use Proximal Policy Optimization (PPO) to train the policy:
103
+ - Agent observes input with full interaction history
104
+ - Generates answer and receives binary reward
105
+ - Policy updates using clipped surrogate objective
106
+ - Value function provides advantage estimates for stable training
107
+
108
+ ### Reward Design
109
+
110
+ Two complementary strategies encourage efficient reasoning:
111
+
112
+ **1. Exponential Reward Decay:**
113
+ ```
114
+ DecayReward(t) = 纬^t if correct, 0 otherwise
115
+ ```
116
+ Favors solving problems in fewer turns.
117
+
118
+ **2. Repetition Penalty:**
119
+ ```
120
+ Penalty(蟿) = 位 路 (1 - E(蟿)/T)
121
+ ```
122
+ Penalizes duplicate answers, encouraging diverse reasoning strategies.
123
+
124
+ This framework enables effective multi-turn RL training on static datasets without requiring expensive annotations or complex environments.
125
+
126
  ## Usage
127
 
128
  ```python
 
177
  - May not perform as well on general tasks
178
  - Recommended for math, logic, and reasoning tasks
179
 
180
+ ## Key Results
181
+
182
+ ### Multi-Turn Reasoning Performance
183
+
184
+ We compare our multi-turn UFO model against a strong single-turn PPO baseline. For a fair comparison, the baseline is evaluated on 5 independent samples (Pass@5), while our model uses 5 sequential attempts with feedback (Succ@5). Success is recorded if any of the 5 responses is correct. We also analyze the impact of varying the maximum number of interaction turns at training.
185
+
186
+ <p align="center">
187
+ <img src="public/compare_baseline.png" width="46.2%" alt="UFO Performance Comparison" />
188
+ <img src="public/multi-turn_training.png" width="45%" alt="Multi-turn Training Process" />
189
+ </p>
190
+ <p align="center" style="font-size: 14px; color: #666;">
191
+ Left: Multi-turn (5-turn) RL significantly outperforms single-turn baseline. Right: Performance comparison with different training turns (1, 5, and 10).
192
+ </p>
193
+
194
+ **Key Findings:**
195
+ - **+14% success rate** over single-turn PPO baseline
196
+ - Benefits generalize to both multi-turn and single-turn inference
197
+ - Best results with 5-turn training; more turns yield diminishing returns
198
+
199
+ ### Effectiveness of Unary Feedback
200
+
201
+ To further investigate the role of unary feedback, we compare model performance under different feedback availability conditions. In scenario (a), unary feedback is provided during both training and validation phases, while in scenario (b), unary feedback is available only during training but not at validation. The results show that access to unary feedback during both phases substantially improves validation success rate. In contrast, providing unary feedback solely during training does not yield improvements, indicating that the benefit of unary feedback is contingent on its availability at inference time.
202
+
203
+ <p align="center"><img src="public/feedback_comparisons_side_by_side.png" width="80%" alt="Effectiveness of Unary Feedback" /></p>
204
+ <p align="center" style="font-size: 14px; color: #666;">
205
+ Success rate comparison under different unary feedback settings: (a) feedback in both training and validation; (b) feedback only in training.
206
+ </p>
207
+
208
+ **Key Insights:**
209
+ - Feedback in both training and validation is crucial for improvement
210
+ - Feedback only in training phase does **not** help at inference
211
+
212
+ ### Reward Design Impact
213
+
214
+ **Exponential Reward Decay:**
215
+ - Decreases the average number of actions required to solve problems by ~10%
216
+ - Encourages faster and more efficient problem solving
217
+
218
+ **Answer Diversity:**
219
+ - Non-repetitive answer ratio increases from 79.7% to 92.8%
220
+ - Multi-turn RL with UFO encourages answer diversity and strengthens robustness
221
+
222
  ## License
223
 
224
+ This model is licensed under Apache 2.0.
225
+
226
+ ## Acknowledgements
227
+
228
+ We thank the [DeepSeek](https://github.com/deepseek-ai/DeepSeek-R1) team for providing the DeepSeek-R1 model and early conceptual inspirations. We are grateful to the [veRL](https://github.com/volcengine/verl) team for their infrastructure support and the [RAGEN](https://github.com/RAGEN-AI/RAGEN) team for their multi-turn RL framework.