Qwen 2.5 3B – Calculator Agent

This is a fine-tuned version of Qwen 2.5 3B Instruct trained to use a calculator tool through multi-turn reinforcement learning with GRPO.

A lighter 0.5B model was also trained and can be found here.

This Github repo shows in depth training run process details

🔧 Model Description

The Qwen 2.5 3B model has been enhanced to interact with a recursive calculator environment that supports four basic arithmetic operations. The agent generates structured tool calls in both XML and YAML format, enabling precise execution of complex expressions. After the calculation is performed by the environment, the model formulates a final human-readable answer.

✅ Key Achievements

Training Method: GRPO, using a hybrid reward signal combining LLM-as-a-judge feedback (Claude-3.5-Haiku) and programmatic verification.
Evaluation Accuracy:
Before RL: 27%
After RL: 89%
Absolute Gain: +62 pts
Training Cost: ~~$23.50 (~~£17.55) on 4x A100 (80GB) GPUs
Total Training Time: ~3 hours

🧪 Evaluation Dataset

The evaluation dataset consists of synthetically generated arithmetic problems designed to be difficult for humans to solve without a calculator. Questions include nested operations and real-world phrasing diversity.

Download the eval dataset

🛠️ Usage Instructions

Requirements

vLLM or Transformers pipeline
Flash Attention recommended for speed
For training/RL: see full setup in GitHub repo

Example Input:

Find the product of 876 and 543, subtract the quotient of 876 divided by 12, and tell me the result.

Expected Output:

<calculator>
operation: subtract
operands:
- operation: multiply
operands:
- 876
- 543
- operation: divide
operands:
- 876
- 12
</calculator>

This output must be passed to the environment to be parsed & calculated. Example in python here

The output from the environment should be provided to model as:

<output>
{tool output}
</output>

Then the model will generate it's final response:

The final result of the calculation is 475,041.

📬 License and Attribution

Base model: Qwen 2.5 3B Instruct
Fine-tuned by: Dan Austin
Repository: GitHub Project

🧠 Training Framework Acknowledgement

This model was trained using parts of the Verifiers framework for structured reinforcement learning. If you use this model or build upon this work, please consider citing:



@article
{brown2025verifiers,
title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
author={Brown, William},
year={2025}
}

Dan-AiTuning
/

calculator_agent_qwen2.5_3b