Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation

A PyTorch implementation of Google DeepMind's Gemma3 270M model built entirely from scratch, featuring a compact transformer architecture.

Model Overview

This is from scratch implementation of the Gemma3 270M architecture that demonstrates modern transformer techniques including sliding window attention, RoPE positional encoding, and mixed precision training. The model maintains the core architectural principles of the official Gemma3 270M while making practical choices for training efficiency.

Training Data

Dataset

  • Source: TinyStories dataset (~600M tokens)
  • Tokenizer: GPT-2 tokenizer for faster data processing compared to Gemma3 270M tokenizer
  • Format: Memory-mapped binary files for efficient loading

Model Details

Training Procedure

Training Hyperparameters

  • learning_rate: 1e-4
  • max_iters: 150000
  • warmup_steps: 1000
  • min_lr: 5e-4
  • eval_iters: 500
  • batch_size: 32
  • block_size: 128
  • gradient_accumulation_steps: 32
  • device: cuda
  • dtype: bfloat16
  • ptdtype: float32

Evaluation results

Detailed training analysis and model evaluation can be found in results/results_interpertation.md, which includes:

  • πŸ“Š Loss Analysis: Training and validation loss curves showing smooth convergence without overfitting
  • πŸ“ Qualitative Evaluation: Story generation examples demonstrating coherent narrative abilities
  • πŸ“ˆ Training Dynamics: Gradient norm analysis and learning rate schedule evaluation
  • 🎯 Model Performance: Final perplexity metrics and generation quality assessment

Key Results:

  • Final train loss: 1.8 (perplexity ~6.0)
  • Final validation loss: 2.0 (perplexity ~7.4)
  • Excellent generalization with no overfitting observed
  • Coherent story generation with proper grammar and age-appropriate content

Usage

Code Snippet

# Import Necessary Libraries 
import torch
import tiktoken
from architecture import model_config, Gemma3Model 

# Tokenizer
enc = tiktoken.get_encoding("gpt2")

# Loading Model
model_config["dtype"] = torch.bfloat16
model = Gemma3Model(model_config)  # re-create the model with same config
device =  "cuda" if torch.cuda.is_available() else "cpu"
best_model_params_path = "best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states

# Inference
sentence = "Dad was telling the kids an adventure tale about a pirate ship"
context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
y = model.generate(context, 200)
print(enc.decode(y.squeeze().tolist()))

Result

Dad was telling the kids an adventure tale about a pirate ship coming to the shore. 

Suddenly, Dad showed John many pictures and showed him what to do. She chose a film for them to watch. 
John was excited. He had never seen one before and was intrigued.

When they arrived, Dad handed John bookshelf safely. "What have you got, John?", asked Dad. John eagerly answered back to Dad. Dad explained that the businessman was a dinosaur that had been guarded by the sea. 

John thought about this for a reason and knew he was too happy with this movie. He said to Dad, "Life is a really fun experience". 
His Dad nodded and said, "Yes, you can accept anything special. It was a very comfortable motorcycle."Once upon a time, there was a nice friendly little boy named John. Every day he would have endless their conversation and encouragement. He was so full of joy and excitement taking action.

Today, John was playing in the backyard when

Limitations and Biases

  • This model is only intended for understanding the architecture of a transformer based model from scratch and get the intuition
  • Inference is super slow as KV cache is absent
  • TinyStories is synthetic data generated by GPT-3.5/4
  • May have inherited biases or patterns from the generating model
  • Limited diversity compared to real human-written content
  • Repetitive narrative structures typical of children's literature
  • 270M parameters is relatively small by modern standards
  • Limited reasoning capabilities compared to larger models

Training Infrastructure

For a complete guide covering the entire process - from data tokenization to inference - please refer to the GitHub repository.

Last Update

2025-09-06

Citation

@misc{gemma3-270m-pytorch,
  title={Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation},
  author={Doula Isham Rashik Hasan},
  year={2025},
  howpublished={\url{https://github.com/di37/gemma3-270M-tinystories-pytorch}},
  note={Implementation of Google DeepMind's Gemma3 270M from scratch pre-trained on TinyStories}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train disham993/gemma3-270m-tiny-stories