Distilled Qwen Model - Full Fine-tuning
This model was created through knowledge distillation from Qwen/Qwen3-8B-Base to Qwen/Qwen3-0.6B-Base using full parameter fine-tuning.
Model Details
- Base Model: Qwen/Qwen3-0.6B-Base
- Teacher Model: Qwen/Qwen3-8B-Base
- Method: Knowledge Distillation with Full Fine-tuning
- Dataset: MMLU (Massive Multitask Language Understanding)
- Distillation Alpha: 0.7
- Temperature: 4.0
- Total Parameters: ~600M (all parameters updated)
- Format: Safetensors (safer and more efficient than PyTorch format)
Training Details
- Training Samples: 285
- Epochs: 30
- Batch Size: 2
- Learning Rate: 5e-05
- Final Eval Loss: N/A
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the distilled model directly
tokenizer = AutoTokenizer.from_pretrained("CarlOwOs/distilled-qwen3-0.6b-full-mmlu")
model = AutoModelForCausalLM.from_pretrained("CarlOwOs/distilled-qwen3-0.6b-full-mmlu")
# Generate text
inputs = tokenizer("Question: What is the capital of France?\nA. London\nB. Berlin\nC. Paris\nD. Madrid\nAnswer:", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Technical Notes
- Model weights are stored in safetensors format for improved security and loading speed
- Compatible with all Hugging Face transformers library versions that support safetensors
- Memory efficient loading and faster inference compared to traditional PyTorch format
Evaluation
This model should be evaluated on MCQA tasks using log-likelihood comparison, as implemented in the evaluation framework.
- Downloads last month
- 33
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for CarlOwOs/distilled-qwen3-0.6b-full-mmlu
Base model
Qwen/Qwen3-0.6B-Base