Qwen 2.5 0.5B β Calculator Agent
This is a fine-tuned version of Qwen 2.5 0.5B Instruct trained to use a calculator tool through multi-turn reinforcement learning with GRPO.
A much more performant 3B model was also trained and can be found here.
This Github repo shows in depth training run process details
π§ Model Description
The Qwen 2.5 0.5B model was adapted to interface with a recursive calculator environment that supports addition, subtraction, multiplication, and division. The agent generates structured tool calls in XML and YAML format, which are then executed by the calculator. After receiving the computed result from the tool, it formulates a final human-readable response.
β Key Achievements
- Training Method: GRPO, using a hybrid reward signal combining LLM-as-a-judge feedback and programmatic verification.
- Evaluation Accuracy:
- Before RL: 0.6%
- After RL: 34%
- Absolute Gain: +33.4 pts
- Training Cost:
$18 (Β£13.47) on 8x RTX6000 Ada GPUs - Total Training Time: ~3 hours
π§ͺ Evaluation Dataset
The evaluation dataset consists of synthetically generated arithmetic problems designed to be difficult for humans to solve without a calculator. Questions include nested operations and real-world phrasing diversity.
π οΈ Usage Instructions
Requirements
- Transformers or vLLM for inference
- Flash Attention recommended for speed
- For training/RL: see full setup in GitHub repo
Example Input:
What's the sum of 987 times 654, and 987 divided by the total of 321 and 11?
Expected Output:
<calculator>
operation: add
operands:
- operation: multiply
operands:
- 987
- 654
- operation: divide
operands:
- 987
- operation: add
operands:
- 321
- 11
</calculator>
This output must be passed to the environment to be parsed & calculated. Example in python here
The output from the environment should be provided to model as:
<output>
{tool output}
</output>
Then the model will generate it's final respoonse:
The result of the calculation is 645,500.97
π¬ License and Attribution
- Base model: Qwen 2.5 0.5B Instruct
- Fine-tuned by: Dan Austin
- Repository: GitHub Project
π§ Training Framework Acknowledgement
This model was trained using parts of the Verifiers framework for structured reinforcement learning. If you use this model or build upon this work, please consider citing:
@article{brown2025verifiers,
title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
author={Brown, William},
year={2025}
}
- Downloads last month
- 3