Qwen3-0.6B Pre-trained on TinyStories

This is a Qwen3-0.6B model pre-trained on the TinyStories dataset for 200k iterations.

Model Details

  • Architecture: Qwen3-0.6B
  • Training Data: TinyStories dataset from HuggingFace
  • Training Iterations: 200,000
  • Parameters: ~596M unique parameters
  • Tokenizer: GPT-2 tokenizer (tiktoken)
  • Training Loss: Available in training history

Quick Start

Download the Model

from huggingface_hub import hf_hub_download
import torch

# Download model weights
model_path = hf_hub_download(
    repo_id="vuminhtue/qwen3-200k-tinystories",
    filename="Qwen3_200k_model_params.pt"
)

# Download config
config_path = hf_hub_download(
    repo_id="vuminhtue/qwen3-200k-tinystories",
    filename="config.json"
)

Load and Use

import torch
import tiktoken
from Qwen3_model import Qwen3Model  # You need this file from the original code

# Set up configuration
QWEN3_CONFIG = {
    "vocab_size": 151936,
    "context_length": 40960,
    "emb_dim": 1024,
    "n_heads": 16,
    "n_layers": 28,
    "hidden_dim": 3072,
    "head_dim": 128,
    "qk_norm": True,
    "n_kv_groups": 8,
    "rope_base": 1000000.0,
    "dtype": torch.bfloat16,
}

# Load model
model = Qwen3Model(QWEN3_CONFIG)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.load_state_dict(torch.load(model_path, map_location=device))
model = model.to(device)
model.eval()

# Generate text
tokenizer = tiktoken.get_encoding("gpt2")
# Your generation code here...

Training Details

  • Optimizer: AdamW with weight decay (0.1)
  • Learning Rate: 1e-4 with warmup and cosine decay
  • Batch Size: 32 with gradient accumulation (32 steps)
  • Context Length: 128 tokens
  • Mixed Precision: bfloat16 training

Model Architecture

  • Grouped Query Attention (GQA) with 8 KV groups
  • RoPE (Rotary Position Embeddings)
  • RMSNorm for normalization
  • SiLU activation function
  • 28 transformer layers

Performance

The model was trained on TinyStories, a dataset of simple stories for children. It can generate coherent short stories in a similar style.

Citation

If you use this model, please cite:

@misc{qwen3-tinystories-2025,
  author = {Tue Vu},
  title = {Qwen3-0.6B Pre-trained on TinyStories},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/vuminhtue/qwen3-200k-tinystories}},
}

License

MIT License

Contact

For questions or issues, please open an issue on the HuggingFace model page.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Spaces using vuminhtue/qwen3_sentiment_tinystories 2