|
--- |
|
license: apache-2.0 |
|
tags: |
|
- reasoning |
|
- mathematics |
|
- reinforcement-learning |
|
datasets: |
|
- AIME |
|
- AMC |
|
- Omni-Math |
|
base_model: R1-Distill-Qwen-1.5B |
|
--- |
|
|
|
# ALP_R1_Qwen1.5B |
|
|
|
R1-Distill-Qwen-1.5B trained with Adaptive Length Penalty (ALP) - reduces token usage by ~50% while maintaining performance. |
|
|
|
## Training |
|
- 100 steps GRPO, batch 512, LR 1e-6, β=1e-7 |
|
- 16 rollouts/prompt for difficulty estimation |
|
- 8K context window |
|
|
|
## Performance (Pass@1) |
|
- MATH-500: 0.81 |
|
- AIME: 0.252 |
|
- OlympiadBench: 0.51 |
|
|
|
## Token Usage |
|
- MATH: 2804→862 (-69%) |
|
- AIME: 4007→3331 (-17%) |
|
- Olympiad: 3606→2107 (-42%) |
|
|
|
## Usage |
|
```python |
|
prompt = f"{problem} Let's think step by step and output the final answer within \\boxed{{}}." |