Model information

Finetine on Model：unsloth/Qwen3-4B-unsloth-bnb-4bit
Dataset: interstellarninja/tool-calls-single-reasoning, mlabonne/FineTome-100k, unsloth/OpenMathReasoning-mini

Eval

some improvements are made after finetuning
gpqa: 0.24 --> 0.333
gsm8k: 0.5 --> 0.68
tool_bench: 0.1667 --> 0.3182

Before finetuning(unsloth/Qwen3-4B-unsloth-bnb-4bit)

+---------+------------+-----------------+---------------+-------+---------+---------+
| Model   | Dataset    | Metric          | Subset        |   Num |   Score | Cat.0   |
+=========+============+=================+===============+=======+=========+=========+
| qwen    | gpqa       | AveragePass@1   | gpqa_extended |    50 |  0.24   | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| qwen    | gpqa       | AveragePass@1   | gpqa_main     |    50 |  0.26   | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| qwen    | gpqa       | AveragePass@1   | gpqa_diamond  |    50 |  0.22   | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| qwen    | gpqa       | AveragePass@1   | OVERALL       |   150 |  0.24   | -       |
+---------+------------+-----------------+---------------+-------+---------+---------+
| qwen    | gsm8k      | AverageAccuracy | main          |    50 |  0.5    | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| qwen    | tool_bench | Act.EM          | in_domain     |    42 |  0.1667 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| qwen    | tool_bench | Act.EM          | out_of_domain |    48 |  0.1667 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| qwen    | tool_bench | Act.EM          | OVERALL       |    90 |  0.1667 | -       |
+---------+------------+-----------------+---------------+-------+---------+---------+

After finetuning(This model)

+---------+------------+-----------------+---------------+-------+---------+---------+
| Model   | Dataset    | Metric          | Subset        |   Num |   Score | Cat.0   |
+=========+============+=================+===============+=======+=========+=========+
| model   | gpqa       | AveragePass@1   | gpqa_extended |    50 |  0.26   | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | gpqa       | AveragePass@1   | gpqa_main     |    50 |  0.36   | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | gpqa       | AveragePass@1   | gpqa_diamond  |    50 |  0.38   | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | gpqa       | AveragePass@1   | OVERALL       |   150 |  0.3333 | -       |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | gsm8k      | AverageAccuracy | main          |    50 |  0.68   | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | tool_bench | Act.EM          | in_domain     |    41 |  0.3171 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | tool_bench | Act.EM          | out_of_domain |    47 |  0.3191 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | tool_bench | Act.EM          | OVERALL       |    88 |  0.3182 | -       |
+---------+------------+-----------------+---------------+-------+---------+---------+

Qwen/Qwen3-4B model

+---------+------------+-----------------+---------------+-------+---------+---------+
| Model   | Dataset    | Metric          | Subset        |   Num |   Score | Cat.0   |
+=========+============+=================+===============+=======+=========+=========+
| model   | gpqa       | AveragePass@1   | gpqa_extended |    50 |    0.32 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | gpqa       | AveragePass@1   | gpqa_main     |    50 |    0.22 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | gpqa       | AveragePass@1   | gpqa_diamond  |    50 |    0.18 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | gpqa       | AveragePass@1   | OVERALL       |   150 |    0.24 | -       |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | gsm8k      | AverageAccuracy | main          |    50 |    0.48 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+ 
| model   | tool_bench | Act.EM          | in_domain     |    43 |  0.1628 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | tool_bench | Act.EM          | out_of_domain |    47 |  0.1702 | default |
+---------+------------+-----------------+---------------+-------+---------+---------+
| model   | tool_bench | Act.EM          | OVERALL       |    90 |  0.1667 | -       |
+---------+------------+-----------------+---------------+-------+---------+---------+

wesjos
/

Qwen3-4B-toolcall

Model information

Eval

Before finetuning(unsloth/Qwen3-4B-unsloth-bnb-4bit)

After finetuning(This model)

Qwen/Qwen3-4B model

Model tree for wesjos/Qwen3-4B-toolcall