Model information

  • Finetine on Model:unsloth/Qwen3-4B-unsloth-bnb-4bit
  • Dataset: interstellarninja/tool-calls-single-reasoning, mlabonne/FineTome-100k, unsloth/OpenMathReasoning-mini

Eval

  • some improvements are made after finetuning
  • gpqa: 0.24 --> 0.333
  • gsm8k: 0.5 --> 0.68
  • tool_bench: 0.1667 --> 0.3182

Before finetuning(unsloth/Qwen3-4B-unsloth-bnb-4bit)

  • +---------+------------+-----------------+---------------+-------+---------+---------+
    | Model   | Dataset    | Metric          | Subset        |   Num |   Score | Cat.0   |
    +=========+============+=================+===============+=======+=========+=========+
    | qwen    | gpqa       | AveragePass@1   | gpqa_extended |    50 |  0.24   | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | qwen    | gpqa       | AveragePass@1   | gpqa_main     |    50 |  0.26   | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | qwen    | gpqa       | AveragePass@1   | gpqa_diamond  |    50 |  0.22   | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | qwen    | gpqa       | AveragePass@1   | OVERALL       |   150 |  0.24   | -       |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | qwen    | gsm8k      | AverageAccuracy | main          |    50 |  0.5    | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | qwen    | tool_bench | Act.EM          | in_domain     |    42 |  0.1667 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | qwen    | tool_bench | Act.EM          | out_of_domain |    48 |  0.1667 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | qwen    | tool_bench | Act.EM          | OVERALL       |    90 |  0.1667 | -       |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    

After finetuning(This model)

  • +---------+------------+-----------------+---------------+-------+---------+---------+
    | Model   | Dataset    | Metric          | Subset        |   Num |   Score | Cat.0   |
    +=========+============+=================+===============+=======+=========+=========+
    | model   | gpqa       | AveragePass@1   | gpqa_extended |    50 |  0.26   | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | gpqa       | AveragePass@1   | gpqa_main     |    50 |  0.36   | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | gpqa       | AveragePass@1   | gpqa_diamond  |    50 |  0.38   | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | gpqa       | AveragePass@1   | OVERALL       |   150 |  0.3333 | -       |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | gsm8k      | AverageAccuracy | main          |    50 |  0.68   | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | tool_bench | Act.EM          | in_domain     |    41 |  0.3171 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | tool_bench | Act.EM          | out_of_domain |    47 |  0.3191 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | tool_bench | Act.EM          | OVERALL       |    88 |  0.3182 | -       |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    

Qwen/Qwen3-4B model

  • +---------+------------+-----------------+---------------+-------+---------+---------+
    | Model   | Dataset    | Metric          | Subset        |   Num |   Score | Cat.0   |
    +=========+============+=================+===============+=======+=========+=========+
    | model   | gpqa       | AveragePass@1   | gpqa_extended |    50 |    0.32 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | gpqa       | AveragePass@1   | gpqa_main     |    50 |    0.22 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | gpqa       | AveragePass@1   | gpqa_diamond  |    50 |    0.18 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | gpqa       | AveragePass@1   | OVERALL       |   150 |    0.24 | -       |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | gsm8k      | AverageAccuracy | main          |    50 |    0.48 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+ 
    | model   | tool_bench | Act.EM          | in_domain     |    43 |  0.1628 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | tool_bench | Act.EM          | out_of_domain |    47 |  0.1702 | default |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    | model   | tool_bench | Act.EM          | OVERALL       |    90 |  0.1667 | -       |
    +---------+------------+-----------------+---------------+-------+---------+---------+
    
Downloads last month
57
Safetensors
Model size
4.02B params
Tensor type
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wesjos/Qwen3-4B-toolcall

Unable to build the model tree, the base model loops to the model itself. Learn more.