Model Card for Qwen3-0.6B-instruction-finetuned

This model is a fine-tuned version of unsloth/Qwen3-0.6B-Base. It has been trained using TRL.

Quick start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="andresnowak/Qwen3-0.6B-instruction-finetuned", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

This model was done using Language modelling (loss done on prompt and completion) Supervised instruction finetuning and this model was also trained by applying some ranom templates as to be able to have more robustness as how questions will be asked apart from the dataest already bein high quality and having a lot of this examples, this was done as we weren't allowed to use chat templates for the evaluation. But this model probably had two problems during training, one being that we didn't filter the dataset to just have examples that combined (prompt and completion) have a size of 2048 (the max size we are using) and instead doing a truncation. Also this model uses left side padding in the tokenizer as flash-attention 2 needs this


environment:
  seed: 42
  use_template: True

model:
  name: Qwen/Qwen3-0.6B-Base
  hub_model_id: andresnowak/Qwen3-0.6B-instruction-finetuned

dataset:
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: codeAlpaca
    size: 0.3
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: noRobots
    size: 0.8
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: openMathGsm8k
    size: 0.3
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: codeV2
    size: 0.3
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: flanV2
    size: 0.8
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: ifData
    size: 0.8
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: mathAlgebra 
    size: 0.3
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: mathGrade
    size: 0.3
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: oasst1
    size: 0.6
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: sciriff
    size: 0.8
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: tableGpt
    size: 0.3
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: tirMath
    size: 0.4
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: wildChat
    size: 0.7
  - name: andresnowak/Instruction-finetuning-mixture-mnlp
    config: mathV5
    size: 0.2

dataset_evaluation:
  - name: cais/mmlu
    config: validation
    subjects: ["abstract_algebra", "anatomy", "astronomy", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_physics", "computer_security", "conceptual_physics", "electrical_engineering", "elementary_mathematics", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_mathematics", "high_school_physics", "high_school_statistics", "machine_learning"]

training:
  learning_rate: 1e-5
  per_device_train_batch_size: 16
  per_device_eval_batch_size: 16
  gradient_accumulation_steps: 8
  num_train_epochs: 2
  weight_decay: 0.00
  warmup_ratio: 0.03
  max_grad_norm: 0.5
  lr_scheduler: "linear"

This model was trained with SFT.

Evaluation results

The performance is as follows:

Benchmark Accuracy (Acc) Normalized Accuracy (Acc Norm)
ARC Challenge 46.0% 45.3%
ARC Easy 59.3% 54.2%
GPQA 29.9% 27.0%
Math QA 24.0% 24.8%
MCQA Evals 37.9% 34.9%
MMLU 47.2% 47.2%
MMLU Pro 13.2% 12.0%
MuSR 43.5% 42.1%
NLP4Education 38.8% 36.5%
Overall 37.8% 36.0%

The tests where done with this prompt (And only MusR used a different one where you add the Question: and Narrative: )

This question assesses challenging STEM problems as found on graduate standardized tests. Carefully evaluate the options and select the correct answer.

---
[Insert Question Here]
---
[Insert Choices Here, e.g.:
A. Option 1
B. Option 2
C. Option 3
D. Option 4]
---

Your response should include the letter and the exact text of the correct choice.
Example: B. Entropy increases.
Answer:

And the teseting was done on [Letter]. [Text answer]

Framework versions

  • TRL: 0.15.2
  • Transformers: 4.51.3
  • Pytorch: 2.5.1+cu121
  • Datasets: 3.6.0
  • Tokenizers: 0.21.0

Citations

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month
744
Safetensors
Model size
596M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andresnowak/Qwen3-0.6B-instruction-finetuned

Finetuned
(107)
this model
Finetunes
2 models
Quantizations
1 model

Dataset used to train andresnowak/Qwen3-0.6B-instruction-finetuned

Collections including andresnowak/Qwen3-0.6B-instruction-finetuned