Distilled Qwen Model - Full Fine-tuning

This model was created through knowledge distillation from Qwen/Qwen3-8B-Base to Qwen/Qwen3-0.6B-Base using full parameter fine-tuning.

Model Details

Base Model: Qwen/Qwen3-0.6B-Base
Teacher Model: Qwen/Qwen3-8B-Base
Method: Knowledge Distillation with Full Fine-tuning
Dataset: MMLU (Massive Multitask Language Understanding)
Distillation Alpha: 0.7
Temperature: 4.0
Total Parameters: ~600M (all parameters updated)
Format: Safetensors (safer and more efficient than PyTorch format)

Training Details

Training Samples: 285
Epochs: 30
Batch Size: 2
Learning Rate: 5e-05
Final Eval Loss: N/A

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the distilled model directly
tokenizer = AutoTokenizer.from_pretrained("CarlOwOs/distilled-qwen3-0.6b-full-mmlu")
model = AutoModelForCausalLM.from_pretrained("CarlOwOs/distilled-qwen3-0.6b-full-mmlu")

# Generate text
inputs = tokenizer("Question: What is the capital of France?\nA. London\nB. Berlin\nC. Paris\nD. Madrid\nAnswer:", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Technical Notes

Model weights are stored in safetensors format for improved security and loading speed
Compatible with all Hugging Face transformers library versions that support safetensors
Memory efficient loading and faster inference compared to traditional PyTorch format

Evaluation

This model should be evaluated on MCQA tasks using log-likelihood comparison, as implemented in the evaluation framework.

CarlOwOs
/

distilled-qwen3-0.6b-full-mmlu