Distilled Qwen Model - Full Fine-tuning

This model was created through knowledge distillation from Qwen/Qwen3-8B-Base to Qwen/Qwen3-0.6B-Base using full parameter fine-tuning.

Model Details

  • Base Model: Qwen/Qwen3-0.6B-Base
  • Teacher Model: Qwen/Qwen3-8B-Base
  • Method: Knowledge Distillation with Full Fine-tuning
  • Dataset: MMLU (Massive Multitask Language Understanding)
  • Distillation Alpha: 0.7
  • Temperature: 4.0
  • Total Parameters: ~600M (all parameters updated)
  • Format: Safetensors (safer and more efficient than PyTorch format)

Training Details

  • Training Samples: 285
  • Epochs: 30
  • Batch Size: 2
  • Learning Rate: 5e-05
  • Final Eval Loss: N/A

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the distilled model directly
tokenizer = AutoTokenizer.from_pretrained("CarlOwOs/distilled-qwen3-0.6b-full-mmlu")
model = AutoModelForCausalLM.from_pretrained("CarlOwOs/distilled-qwen3-0.6b-full-mmlu")

# Generate text
inputs = tokenizer("Question: What is the capital of France?\nA. London\nB. Berlin\nC. Paris\nD. Madrid\nAnswer:", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Technical Notes

  • Model weights are stored in safetensors format for improved security and loading speed
  • Compatible with all Hugging Face transformers library versions that support safetensors
  • Memory efficient loading and faster inference compared to traditional PyTorch format

Evaluation

This model should be evaluated on MCQA tasks using log-likelihood comparison, as implemented in the evaluation framework.

Downloads last month
33
Safetensors
Model size
752M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CarlOwOs/distilled-qwen3-0.6b-full-mmlu

Finetuned
(284)
this model

Dataset used to train CarlOwOs/distilled-qwen3-0.6b-full-mmlu