Qwen3-30M with GPT-2 Tokenizer (FP16)

A 30M parameter version of Qwen3-0.6B using GPT-2's tokenizer for better compatibility, optimized with FP16 precision.

Model Details

Base Model: Qwen/Qwen3-0.6B
Architecture: Qwen3 (8 layers, 224 hidden size)
Parameters: ~35M (reduced from 637M)
Tokenizer: GPT-2 (50,257 vocabulary)
Vocabulary: Reduced from 151,936 to 50,257 tokens
Precision: FP16 (half precision for memory efficiency)
Model Size: ~60MB (vs ~120MB in FP32)

Architecture Specifications

Layers: 8 transformer layers
Hidden Size: 224
Intermediate Size: 896 (4x hidden_size)
Attention Heads: 8
Key-Value Heads: 8
Max Position Embeddings: 32,768
Activation: SiLU

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load with automatic fp16 support
tokenizer = AutoTokenizer.from_pretrained("Mostafa8Mehrabi/qwen3-30m-fp16")
model = AutoModelForCausalLM.from_pretrained(
    "Mostafa8Mehrabi/qwen3-30m-fp16",
    torch_dtype=torch.float16,  # Explicitly use fp16
    device_map="auto"  # Automatically place on available device
)

# For GPU inference (recommended)
# model = model.to("cuda") # if you have a GPU

inputs = tokenizer("Hello, how are you?", return_tensors="pt")
# Move inputs to same device as model if using GPU
# inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Key Features

✅ FP16 Precision: 50% smaller model size, faster inference
✅ 30M Parameters: Ultra-lightweight for edge deployment
✅ 8 Layers: Balanced depth for performance
✅ Standard GPT-2 tokenizer (no trust_remote_code)
✅ Compatible vocabulary sizes
✅ SafeTensors format for faster loading
✅ Works like any HuggingFace model
✅ 21x smaller than original Qwen3-0.6B
✅ GPU optimized for efficient inference

Architecture Comparison

Component	Original	This Model
Parameters	637M	~35M
Vocabulary	151,936	50,257
Hidden Size	1024	224
Layers	28	8
Intermediate Size	4096	896
Attention Heads	16	8
Tokenizer	Qwen3	GPT-2
Precision	FP32	FP16
Model Size	~1.2GB	~60MB

Memory Requirements

FP16: 60MB model + ~30MB working memory = **90MB total**
FP32: ~120MB model + ~60MB working memory = ~180MB total
Memory savings: ~50% reduction compared to FP32
Ultra-lightweight: Perfect for mobile and edge devices

Performance Notes

FP16 provides significant memory savings with minimal quality loss
30M parameters optimized for fast inference while maintaining coherence
Ideal for deployment in resource-constrained environments
Compatible with both CPU and GPU inference
Faster loading times due to smaller file size
8 layers provide good balance between model capacity and speed

Mostafa8Mehrabi
/

qwen3-30m-fp16

Qwen3-30M with GPT-2 Tokenizer (FP16)

Model Details

Architecture Specifications

Usage

Key Features

Architecture Comparison

Memory Requirements

Performance Notes

Model tree for Mostafa8Mehrabi/qwen3-30m-fp16