Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation

A PyTorch implementation of Google DeepMind's Gemma3 270M model built entirely from scratch, featuring a compact transformer architecture.

Model Overview

This is from scratch implementation of the Gemma3 270M architecture that demonstrates modern transformer techniques including sliding window attention, RoPE positional encoding, and mixed precision training. The model maintains the core architectural principles of the official Gemma3 270M while making practical choices for training efficiency.

Training Data

Dataset

Source: TinyStories dataset (~600M tokens)
Tokenizer: GPT-2 tokenizer for faster data processing compared to Gemma3 270M tokenizer
Format: Memory-mapped binary files for efficient loading

Model Details

This is the base model itself solely trained on TinyStories dataset for 10 hours on A6000 GPU.
Task: text-generation
Language: en
Dataset: https://huggingface.co/datasets/roneneldan/TinyStories

Training Procedure

Training Hyperparameters

learning_rate: 1e-4
max_iters: 150000
warmup_steps: 1000
min_lr: 5e-4
eval_iters: 500
batch_size: 32
block_size: 128
gradient_accumulation_steps: 32
device: cuda
dtype: bfloat16
ptdtype: float32

Evaluation results

Detailed training analysis and model evaluation can be found in results/results_interpertation.md, which includes:

📊 Loss Analysis: Training and validation loss curves showing smooth convergence without overfitting
📝 Qualitative Evaluation: Story generation examples demonstrating coherent narrative abilities
📈 Training Dynamics: Gradient norm analysis and learning rate schedule evaluation
🎯 Model Performance: Final perplexity metrics and generation quality assessment

Key Results:

Final train loss: 1.8 (perplexity ~6.0)
Final validation loss: 2.0 (perplexity ~7.4)
Excellent generalization with no overfitting observed
Coherent story generation with proper grammar and age-appropriate content

Usage

Code Snippet

# Import Necessary Libraries 
import torch
import tiktoken
from architecture import model_config, Gemma3Model 

# Tokenizer
enc = tiktoken.get_encoding("gpt2")

# Loading Model
model_config["dtype"] = torch.bfloat16
model = Gemma3Model(model_config)  # re-create the model with same config
device =  "cuda" if torch.cuda.is_available() else "cpu"
best_model_params_path = "best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states

# Inference
sentence = "Dad was telling the kids an adventure tale about a pirate ship"
context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
y = model.generate(context, 200)
print(enc.decode(y.squeeze().tolist()))

Result

Dad was telling the kids an adventure tale about a pirate ship coming to the shore. 

Suddenly, Dad showed John many pictures and showed him what to do. She chose a film for them to watch. 
John was excited. He had never seen one before and was intrigued.

When they arrived, Dad handed John bookshelf safely. "What have you got, John?", asked Dad. John eagerly answered back to Dad. Dad explained that the businessman was a dinosaur that had been guarded by the sea. 

John thought about this for a reason and knew he was too happy with this movie. He said to Dad, "Life is a really fun experience". 
His Dad nodded and said, "Yes, you can accept anything special. It was a very comfortable motorcycle."Once upon a time, there was a nice friendly little boy named John. Every day he would have endless their conversation and encouragement. He was so full of joy and excitement taking action.

Today, John was playing in the backyard when

Limitations and Biases

This model is only intended for understanding the architecture of a transformer based model from scratch and get the intuition
Inference is super slow as KV cache is absent
TinyStories is synthetic data generated by GPT-3.5/4
May have inherited biases or patterns from the generating model
Limited diversity compared to real human-written content
Repetitive narrative structures typical of children's literature
270M parameters is relatively small by modern standards
Limited reasoning capabilities compared to larger models

Training Infrastructure

For a complete guide covering the entire process - from data tokenization to inference - please refer to the GitHub repository.

Last Update

2025-09-06

Citation

@misc{gemma3-270m-pytorch,
  title={Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation},
  author={Doula Isham Rashik Hasan},
  year={2025},
  howpublished={\url{https://github.com/di37/gemma3-270M-tinystories-pytorch}},
  note={Implementation of Google DeepMind's Gemma3 270M from scratch pre-trained on TinyStories}
}

disham993
/

gemma3-270m-tiny-stories