Dan-AiTuning commited on
Commit
2defd08
·
verified ·
1 Parent(s): 30eef20

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-0.5B-Instruct
4
+ tags:
5
+ - agent
6
+ - grpo
7
+ - multi-turn-rl
8
+ ---
9
+ # Qwen 2.5 0.5B – Calculator Agent
10
+
11
+ This is a fine-tuned version of [Qwen 2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) trained to use a calculator tool through multi-turn reinforcement learning with GRPO.
12
+
13
+ A much more performant 3B model was also trained and can be found here.
14
+
15
+ **[This Github repo](https://github.com/Danau5tin/calculator_agent_rl) shows in depth training run process details**
16
+
17
+ ---
18
+
19
+ ## 🔧 Model Description
20
+
21
+ The Qwen 2.5 0.5B model was adapted to interface with a recursive calculator environment that supports addition, subtraction, multiplication, and division.
22
+ The agent generates structured tool calls in XML and YAML format, which are then executed by the calculator.
23
+ After receiving the computed result from the tool, it formulates a final human-readable response.
24
+
25
+ ---
26
+
27
+ ## ✅ Key Achievements
28
+
29
+ - **Training Method**: GRPO, using a hybrid reward signal combining LLM-as-a-judge feedback and programmatic verification.
30
+ - **Evaluation Accuracy**:
31
+ - Before RL: **0.6%**
32
+ - After RL: **34%**
33
+ - **Absolute Gain: +33.4 pts**
34
+ - **Training Cost**: ~$18 (~£13.47) on 8x RTX6000 Ada GPUs
35
+ - **Total Training Time**: ~3 hours
36
+
37
+ ---
38
+
39
+ ## 🧪 Evaluation Dataset
40
+
41
+ The evaluation dataset consists of synthetically generated arithmetic problems designed to be difficult for humans to solve without a calculator. Questions include nested operations and real-world phrasing diversity.
42
+
43
+ [Download the eval dataset](https://github.com/Danau5tin/agentic_environments/blob/qwen/examples/calculator_agent/datasets/basic_calculations_eval.csv)
44
+
45
+ ---
46
+
47
+ ## 🛠️ Usage Instructions
48
+
49
+ ### Requirements
50
+
51
+ - Transformers or vLLM for inference
52
+ - Flash Attention recommended for speed
53
+ - For training/RL: see full setup in [GitHub repo](https://github.com/Dan-AiTuning/calculator_agent_rl)
54
+
55
+ ### Example Input:
56
+
57
+ ```text
58
+ What's the sum of 987 times 654, and 987 divided by the total of 321 and 11?
59
+ ```
60
+
61
+ ### Expected Output:
62
+
63
+ ```xml
64
+ <calculator>
65
+ operation: add
66
+ operands:
67
+ - operation: multiply
68
+ operands:
69
+ - 987
70
+ - 654
71
+ - operation: divide
72
+ operands:
73
+ - 987
74
+ - operation: add
75
+ operands:
76
+ - 321
77
+ - 11
78
+ </calculator>
79
+ ```
80
+
81
+ This output must be passed to the environment to be parsed & calculated. Example in python [here](https://github.com/Danau5tin/calculator_agent_rl/tree/main/src/environment/)
82
+
83
+ The output from the environment should be provided to model as:
84
+ ```xml
85
+ <output>
86
+ {tool output}
87
+ </output>
88
+ ```
89
+
90
+ Then the model will generate it's final respoonse:
91
+
92
+ ```text
93
+ The result of the calculation is 645,500.97
94
+ ```
95
+
96
+ ---
97
+
98
+ ## 📬 License and Attribution
99
+
100
+ - Base model: [Qwen 2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
101
+ - Fine-tuned by: Dan Austin
102
+ - Repository: [GitHub Project](https://github.com/Dan-AiTuning/calculator_agent_rl)
103
+
104
+ ## 🧠 Training Framework Acknowledgement
105
+
106
+ This model was trained using parts of the [Verifiers](https://github.com/willccbb/verifiers) framework for structured reinforcement learning. If you use this model or build upon this work, please consider citing:
107
+
108
+ @article{brown2025verifiers,
109
+ title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
110
+ author={Brown, William},
111
+ year={2025}
112
+ }