From Zero to AI: Build Your First Language Model in 5 Minutes with Google's Gemma
Ever looked at AI like ChatGPT and thought it was pure magic? What if you could build your very own, tiny language model from scratch, right now, and prove it’s not magic—it's technology you can understand?
In this guide, we will do exactly that. We'll take a "brain" from Google's powerful Gemma model (its tokenizer) and attach it to a brand-new, "baby" model that we'll create ourselves. Then, we'll teach it just two sentences and watch it learn to speak. By the end, you'll have trained your first model and seen it work with your own eyes.
This is the ultimate "version 0.1" project, designed to take you from zero to your first real success in AI.
Prerequisites
- Python and Pip: You should have Python installed on your system.
- A Local Gemma Model Folder: You need the folder for
gemma-3-1b-it-qat-q4_0-unquantized
. We only need its configuration and tokenizer files, not the massive model weights. - Hugging Face Transformers: The magic wand for our project. Install it in your terminal:
pip install transformers torch accelerate
The Blueprint: Our "Frankenstein" AI
Our strategy is simple but brilliant:
- Borrow the Brain (The Tokenizer): A big model like Gemma has spent thousands of hours learning how to read text and break it down into numbers (tokens). We'll borrow this pre-trained knowledge so our model doesn't have to learn English from scratch.
- Build the Body (The Model): We will define a new, incredibly small model architecture. It's like an empty shell, with its weights initialized to random nonsense. It knows nothing.
- The Lesson (The Training): We'll show our new model two sentences over and over again until it starts to recognize the patterns.
- The Test (The Result): We'll give it the beginning of a sentence and see if it can complete it correctly.
The Code: Your Complete Training Script
Create a file named train_tiny_model.py
and paste the entire code block below into it. The code is heavily documented with Pydoc-style comments to explain every part.
# -*- coding: utf-8 -*-
"""A complete script to train a tiny Gemma model from scratch.
This script demonstrates the full pipeline of:
1. Loading a pre-trained tokenizer from a local model directory.
2. Preparing a very small, custom dataset in English.
3. Defining a new, miniature Gemma model architecture with random weights.
4. Training the new model on the custom dataset using the Hugging Face Trainer.
5. Saving the final trained model and testing its text generation capabilities.
"""
import torch
from transformers import (
AutoConfig,
AutoTokenizer,
GemmaConfig,
GemmaForCausalLM,
Trainer,
TrainingArguments,
pipeline,
)
# ==============================================================================
# STEP 1: CONFIGURE AND LOAD THE PRE-TRAINED TOKENIZER
# ==============================================================================
# This path points to your local, unquantized model directory.
# The tokenizer from this model will be used, but not the model weights.
local_model_path = "./gemma-3-1b-it-qat-q4_0-unquantized"
print(f"Step 1: Loading tokenizer from local path '{local_model_path}'...")
try:
tokenizer = AutoTokenizer.from_pretrained(local_model_path)
print("Tokenizer loaded successfully!")
except Exception as e:
print(f"Error: Could not load tokenizer. Ensure the path is correct: {e}")
exit()
# Gemma models often don't have a pad_token set by default. We'll set it
# to the end-of-sentence token, which is a standard practice.
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print("Set model's pad_token to be the eos_token.")
# ==============================================================================
# STEP 2: PREPARE THE CUSTOM DATASET
# ==============================================================================
print("\nStep 2: Preparing the custom dataset...")
# Our entire training dataset consists of just two English sentences.
sentences = [
"The first sentence is about machine learning.",
"The second sentence is about natural language processing.",
]
# Tokenize the sentences, converting text into numerical IDs that the model
# can understand.
inputs = tokenizer(
sentences,
padding=True, # Pad sentences to the same length within the batch.
truncation=True, # Truncate sentences if they are too long.
return_tensors="pt" # Return PyTorch tensors.
)
print("Dataset tokenized. Input tensor shape:", inputs['input_ids'].shape)
# ==============================================================================
# STEP 3: DEFINE A NEW, TINY MODEL ARCHITECTURE
# ==============================================================================
print("\nStep 3: Defining a new, tiny model architecture...")
# First, load the original configuration to get essential parameters
# like vocab_size, which MUST match the tokenizer.
base_config = AutoConfig.from_pretrained(local_model_path)
# Now, define the configuration for our new, very small model.
# This is the "from scratch" part, as we are defining a new structure
# without loading any pre-trained weights.
small_config = GemmaConfig(
hidden_size=128, # Drastically reduced hidden layer size.
intermediate_size=512, # Drastically reduced feed-forward layer size.
num_hidden_layers=2, # Only 2 layers instead of many more.
num_attention_heads=4, # Number of query heads.
num_key_value_heads=4, # Number of key/value heads (must be present for Gemma).
max_position_embeddings=1024, # Maximum sequence length the model can handle.
vocab_size=base_config.vocab_size, # CRITICAL: Must match the tokenizer.
pad_token_id=tokenizer.pad_token_id,
)
# Instantiate a new model from our tiny configuration.
# Its weights will be randomly initialized.
small_model = GemmaForCausalLM(small_config)
print("Tiny model created successfully!")
print(f"Model parameter count: {small_model.num_parameters():,}")
# ==============================================================================
# STEP 4: CONFIGURE AND RUN THE TRAINING
# ==============================================================================
print("\nStep 4: Configuring and starting the training...")
class SimpleDataset(torch.utils.data.Dataset):
"""A simple Dataset class compatible with the Hugging Face Trainer.
This class wraps the tokenized inputs dictionary and makes it accessible
by index. For causal language modeling, it sets the 'labels' to be the
same as the 'input_ids'.
Attributes:
encodings (dict): A dictionary from the tokenizer containing tensors
like 'input_ids' and 'attention_mask'.
"""
def __init__(self, encodings):
"""Initializes the SimpleDataset.
Args:
encodings (dict): The tokenized inputs from a Hugging Face tokenizer.
"""
self.encodings = encodings
def __getitem__(self, idx):
"""Retrieves an item (a dictionary of tensors) by index."""
# For language modeling, the model learns to predict the next token,
# so the `labels` are typically the `input_ids` themselves.
item = {key: val[idx].clone() for key, val in self.encodings.items()}
item['labels'] = item['input_ids']
return item
def __len__(self):
"""Returns the total number of samples in the dataset."""
return len(self.encodings['input_ids'])
train_dataset = SimpleDataset(inputs)
# Define the arguments that control the training process.
training_args = TrainingArguments(
output_dir="./gemma_tiny_model_output_en", # Directory for checkpoints and logs.
num_train_epochs=100, # Train for 100 epochs to ensure the model memorizes the data.
per_device_train_batch_size=1, # Process one sentence at a time.
logging_steps=10, # Log training loss every 10 steps.
save_strategy="no", # Do not save checkpoints during training.
report_to="none", # Disable integrations like W&B.
)
# Initialize the Trainer, which handles the entire training loop.
trainer = Trainer(
model=small_model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
print("Training complete!")
# ==============================================================================
# STEP 5: SAVE THE FINAL MODEL AND TEST IT
# ==============================================================================
print("\nStep 5: Saving the final model and testing...")
# Define the path to save the final, trained model and tokenizer.
final_model_path = "./my_first_tiny_gemma_en"
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)
print(f"Model saved to: {final_model_path}")
print("\n--- Testing Generation ---")
# The prompt should match the beginning of one of our training sentences.
prompt = "The first sentence is"
# Create a text-generation pipeline with our newly trained model.
generator = pipeline(
'text-generation',
model=final_model_path,
tokenizer=final_model_path,
device=0 if torch.cuda.is_available() else -1 # Use GPU if available.
)
# Generate text. The model should overfit and complete the sentence it saw.
outputs = generator(prompt, max_new_tokens=15)
print(f"\nInput Prompt: '{prompt}'")
print(f"Model Generation: '{outputs[0]['generated_text']}'")
print("\n--- Experiment Successful ---")
print("You have successfully trained and tested a miniature language model from scratch!")
The Moment of Truth: Running the Script
Now, open your terminal, navigate to the directory where you saved train_tiny_model.py
, and run it:
python train_tiny_model.py
Understanding Your Success: The Output
You will see a lot of text, but let's focus on the most important parts. Your output will look almost identical to this:
Step 4: Configuring and starting the training...
{'loss': 10.4814, ...}
{'loss': 9.3495, ...}
...
{'loss': 5.8222, ...}
{'train_runtime': 3.6616, ... 'train_loss': 6.8524, 'epoch': 100.0}
Training complete!
Step 5: Saving the final model and testing...
Model saved to: ./my_first_tiny_gemma_en
--- Testing Generation ---
Input Prompt: 'The first sentence is'
Model Generation: 'The first sentence is about machine learning............'
--- Experiment Successful ---
You have successfully trained and tested a miniature language model from scratch!
Let's break down why this is a huge success:
The Loss Went Down: Look at the
{'loss': ...}
lines. The number started high (around 10.4) and ended low (around 5.8). Loss is a measure of the model's error. Seeing it decrease is a mathematical guarantee that your model was learning!It Completed Your Sentence! This is the magic moment.
- You prompted it with:
'The first sentence is'
- It generated:
'The first sentence is about machine learning............'
- It worked! It perfectly recalled the sentence it was taught. The random, nonsense model you created just minutes ago has learned to associate your prompt with the correct completion. The trailing dots (
.
) are just the model "running out of things to say" because its knowledge is so limited, which is exactly what we'd expect.
- You prompted it with:
Conclusion and What's Next?
Congratulations! You have officially built and trained your first language model. You've demystified the process and taken a huge first step. You now know that at its core, training an AI is about:
- Defining a structure.
- Showing it data.
- Minimizing its error.
- Testing its knowledge.
Now, feel free to experiment! Try these challenges:
- Change the
prompt
in the script to"The second sentence is"
and see if it generates the correct response about natural language processing. - Add a third or fourth sentence to the
sentences
list and retrain the model. - Change the number of
num_train_epochs
to 500 and see if the finalloss
gets even lower.
Welcome to the world of building AI. Your journey from 0 to 0.1 is complete. The path to 1.0 is now up to you