Dan-AiTuning
/

calculator_agent_qwen2.5_0.5b

+---
+base_model:
+- Qwen/Qwen2.5-0.5B-Instruct
+tags:
+- agent
+- grpo
+- multi-turn-rl
+---
+# Qwen 2.5 0.5B – Calculator Agent
+This is a fine-tuned version of [Qwen 2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) trained to use a calculator tool through multi-turn reinforcement learning with GRPO.
+A much more performant 3B model was also trained and can be found here.
+**[This Github repo](https://github.com/Danau5tin/calculator_agent_rl) shows in depth training run process details**
+---
+## 🔧 Model Description
+The Qwen 2.5 0.5B model was adapted to interface with a recursive calculator environment that supports addition, subtraction, multiplication, and division.
+The agent generates structured tool calls in XML and YAML format, which are then executed by the calculator.
+After receiving the computed result from the tool, it formulates a final human-readable response.
+---
+## ✅ Key Achievements
+- **Training Method**: GRPO, using a hybrid reward signal combining LLM-as-a-judge feedback and programmatic verification.
+- **Evaluation Accuracy**:
+  - Before RL: **0.6%**
+  - After RL: **34%**
+  - **Absolute Gain: +33.4 pts**
+- **Training Cost**: ~$18 (~£13.47) on 8x RTX6000 Ada GPUs
+- **Total Training Time**: ~3 hours
+---
+## 🧪 Evaluation Dataset
+The evaluation dataset consists of synthetically generated arithmetic problems designed to be difficult for humans to solve without a calculator. Questions include nested operations and real-world phrasing diversity.
+[Download the eval dataset](https://github.com/Danau5tin/agentic_environments/blob/qwen/examples/calculator_agent/datasets/basic_calculations_eval.csv)
+---
+## 🛠️ Usage Instructions
+### Requirements
+- Transformers or vLLM for inference
+- Flash Attention recommended for speed
+- For training/RL: see full setup in [GitHub repo](https://github.com/Dan-AiTuning/calculator_agent_rl)
+### Example Input:
+```text
+What's the sum of 987 times 654, and 987 divided by the total of 321 and 11?
+```
+### Expected Output:
+```xml
+<calculator>
+operation: add
+operands:
+  - operation: multiply
+    operands:
+      - 987
+      - 654
+  - operation: divide
+    operands:
+      - 987
+      - operation: add
+        operands:
+          - 321
+          - 11
+</calculator>
+```
+This output must be passed to the environment to be parsed & calculated. Example in python [here](https://github.com/Danau5tin/calculator_agent_rl/tree/main/src/environment/)
+The output from the environment should be provided to model as:
+```xml
+<output>
+{tool output}
+</output>
+```
+Then the model will generate it's final respoonse:
+```text
+The result of the calculation is 645,500.97
+```
+---
+## 📬 License and Attribution
+- Base model: [Qwen 2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
+- Fine-tuned by: Dan Austin
+- Repository: [GitHub Project](https://github.com/Dan-AiTuning/calculator_agent_rl)
+## 🧠 Training Framework Acknowledgement
+This model was trained using parts of the [Verifiers](https://github.com/willccbb/verifiers) framework for structured reinforcement learning. If you use this model or build upon this work, please consider citing:
+@article{brown2025verifiers,
+  title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
+  author={Brown, William},
+  year={2025}
+}