π Qwen3-4B-I-1509
π§Ύ Model Overview
- ποΈ Base Model: Qwen3-4B-Instruct-2507
- π― Training Method: Reinforcement Learning (GRPO) with multiple reward functions
This model (Qwen3-4B-I-1509
) is finetuned for π§ tool-use and π function call generation.
π Reward Functions
The model was trained with multi-signal rewards:
π Rule-based Reward
βοΈ Checks correctness of function call name and arguments.
β Partial credit for matching subsets of arguments.π Self-Certainty Reward
β‘ Encourages confident predictions.π§ Tool-Call Reward
β Validates structural correctness.
βοΈ Training Configuration
- β‘ Optimizer: AdamW
- π Learning Rate: 5e-6 with cosine decay (
min_lr_rate=0.1
) - β³ Scheduler: cosine_with_min_lr
- π Generations per Prompt: 4
π Eval Result:
Important notes:
Why it lower than technical report?
There have a limit of hardware so have to reduce some max tokens when evaluation for both 2 models
Fair evaluate ?
I use the same configuration for all the models I review for larger or with a same size model.
Tau-Bench
π§ Model | βοΈ Airline | ποΈ Retail |
---|---|---|
Qwen3-4B-I-1509 | 0.2800 | 0.2783 |
Base Model | 0.3000 | 0.2261 |
ACEBench
Model | Overall Accuracy |
---|---|
Qwen3-4B-I-1509 | 0.677 |
Qwen3-4B-Instruct-2507 (base) | 0.635 |
Salesforce/Llama-xLAM-2-8b-fc-r | 0.5792 |
curently upadate more
Contribute:
I would be happy to receive a contribution to this model and get feedback about performance, quality of model
Support me at:
π Citation
If you use this model in your research or application, please cite:
@misc{qwen3-4b-i-1509,
title = {Qwen3-4B-I-1509: Fine-tuned Qwen3-4B-Instruct with GRPO for Tool-Use and Function Calling},
author = {Beyoru},
year = {2025},
howpublished = {\url{https://huggingface.co/beyoru/Qwen3-4B-I-1509}}
}
- Downloads last month
- 121
Model tree for beyoru/Qwen3-4B-I-1509
Unable to build the model tree, the base model loops to the model itself. Learn more.