Qwen 2.5 3B β Calculator Agent
This is a fine-tuned version of Qwen 2.5 3B Instruct trained to use a calculator tool through multi-turn reinforcement learning with GRPO.
A lighter 0.5B model was also trained and can be found here.
This Github repo shows in depth training run process details
π§ Model Description
The Qwen 2.5 3B model has been enhanced to interact with a recursive calculator environment that supports four basic arithmetic operations. The agent generates structured tool calls in both XML and YAML format, enabling precise execution of complex expressions. After the calculation is performed by the environment, the model formulates a final human-readable answer.
β Key Achievements
- Training Method: GRPO, using a hybrid reward signal combining LLM-as-a-judge feedback (Claude-3.5-Haiku) and programmatic verification.
- Evaluation Accuracy:
- Before RL: 27%
- After RL: 89%
- Absolute Gain: +62 pts
- Training Cost:
$23.50 (Β£17.55) on 4x A100 (80GB) GPUs - Total Training Time: ~3 hours
π§ͺ Evaluation Dataset
The evaluation dataset consists of synthetically generated arithmetic problems designed to be difficult for humans to solve without a calculator. Questions include nested operations and real-world phrasing diversity.
π οΈ Usage Instructions
Requirements
- vLLM or Transformers pipeline
- Flash Attention recommended for speed
- For training/RL: see full setup in GitHub repo
Example Input:
Find the product of 876 and 543, subtract the quotient of 876 divided by 12, and tell me the result.
Expected Output:
<calculator>
operation: subtract
operands:
- operation: multiply
operands:
- 876
- 543
- operation: divide
operands:
- 876
- 12
</calculator>
This output must be passed to the environment to be parsed & calculated. Example in python here
The output from the environment should be provided to model as:
<output>
{tool output}
</output>
Then the model will generate it's final response:
The final result of the calculation is 475,041.
π¬ License and Attribution
- Base model: Qwen 2.5 3B Instruct
- Fine-tuned by: Dan Austin
- Repository: GitHub Project
π§ Training Framework Acknowledgement
This model was trained using parts of the Verifiers framework for structured reinforcement learning. If you use this model or build upon this work, please consider citing:
@article
{brown2025verifiers,
title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
author={Brown, William},
year={2025}
}
- Downloads last month
- 6