Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation
A PyTorch implementation of Google DeepMind's Gemma3 270M model built entirely from scratch, featuring a compact transformer architecture.
Model Overview
This is from scratch implementation of the Gemma3 270M architecture that demonstrates modern transformer techniques including sliding window attention, RoPE positional encoding, and mixed precision training. The model maintains the core architectural principles of the official Gemma3 270M while making practical choices for training efficiency.
Training Data
Dataset
- Source: TinyStories dataset (~600M tokens)
- Tokenizer: GPT-2 tokenizer for faster data processing compared to Gemma3 270M tokenizer
- Format: Memory-mapped binary files for efficient loading
Model Details
- This is the base model itself solely trained on TinyStories dataset for 10 hours on A6000 GPU.
- Task: text-generation
- Language: en
- Dataset: https://huggingface.co/datasets/roneneldan/TinyStories
Training Procedure
Training Hyperparameters
- learning_rate: 1e-4
- max_iters: 150000
- warmup_steps: 1000
- min_lr: 5e-4
- eval_iters: 500
- batch_size: 32
- block_size: 128
- gradient_accumulation_steps: 32
- device: cuda
- dtype: bfloat16
- ptdtype: float32
Evaluation results
Detailed training analysis and model evaluation can be found in results/results_interpertation.md
, which includes:
- π Loss Analysis: Training and validation loss curves showing smooth convergence without overfitting
- π Qualitative Evaluation: Story generation examples demonstrating coherent narrative abilities
- π Training Dynamics: Gradient norm analysis and learning rate schedule evaluation
- π― Model Performance: Final perplexity metrics and generation quality assessment
Key Results:
- Final train loss: 1.8 (perplexity ~6.0)
- Final validation loss: 2.0 (perplexity ~7.4)
- Excellent generalization with no overfitting observed
- Coherent story generation with proper grammar and age-appropriate content
Usage
Code Snippet
# Import Necessary Libraries
import torch
import tiktoken
from architecture import model_config, Gemma3Model
# Tokenizer
enc = tiktoken.get_encoding("gpt2")
# Loading Model
model_config["dtype"] = torch.bfloat16
model = Gemma3Model(model_config) # re-create the model with same config
device = "cuda" if torch.cuda.is_available() else "cpu"
best_model_params_path = "best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states
# Inference
sentence = "Dad was telling the kids an adventure tale about a pirate ship"
context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
y = model.generate(context, 200)
print(enc.decode(y.squeeze().tolist()))
Result
Dad was telling the kids an adventure tale about a pirate ship coming to the shore.
Suddenly, Dad showed John many pictures and showed him what to do. She chose a film for them to watch.
John was excited. He had never seen one before and was intrigued.
When they arrived, Dad handed John bookshelf safely. "What have you got, John?", asked Dad. John eagerly answered back to Dad. Dad explained that the businessman was a dinosaur that had been guarded by the sea.
John thought about this for a reason and knew he was too happy with this movie. He said to Dad, "Life is a really fun experience".
His Dad nodded and said, "Yes, you can accept anything special. It was a very comfortable motorcycle."Once upon a time, there was a nice friendly little boy named John. Every day he would have endless their conversation and encouragement. He was so full of joy and excitement taking action.
Today, John was playing in the backyard when
Limitations and Biases
- This model is only intended for understanding the architecture of a transformer based model from scratch and get the intuition
- Inference is super slow as KV cache is absent
- TinyStories is synthetic data generated by GPT-3.5/4
- May have inherited biases or patterns from the generating model
- Limited diversity compared to real human-written content
- Repetitive narrative structures typical of children's literature
- 270M parameters is relatively small by modern standards
- Limited reasoning capabilities compared to larger models
Training Infrastructure
For a complete guide covering the entire process - from data tokenization to inference - please refer to the GitHub repository.
Last Update
2025-09-06
Citation
@misc{gemma3-270m-pytorch,
title={Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation},
author={Doula Isham Rashik Hasan},
year={2025},
howpublished={\url{https://github.com/di37/gemma3-270M-tinystories-pytorch}},
note={Implementation of Google DeepMind's Gemma3 270M from scratch pre-trained on TinyStories}
}