Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM

Published March 12, 2025

Update on GitHub

Upvote

468

Aritra Roy Gosthipaty

TL;DR

What is Gemma 3?

Technical Enhancements in Gemma 3
Longer Context Length

Multimodality

Multilinguality

Gemma 3 evaluation

Inference with 🤗 transformers

On Device & Low Resource Devices
MLX

Llama.cpp

Deploy on Hugging Face Endpoints

Acknowledgements

TL;DR

Today Google releases Gemma 3, a new iteration of their Gemma family of models. The models range from 1B to 27B parameters, have a context window up to 128k tokens, can accept images and text, and support 140+ languages.

Try out Gemma 3 now 👉🏻 Gemma 3 Space

	Gemma 2	Gemma 3
Size Variants	2B 9B 27B	1B 4B 12B 27B
Context Window Length	8k	32k (1B) 128k (4B, 12B, 27B)
Multimodality (Images and Text)	❌	❌ (1B) ✅ (4B, 12B, 27B)
Multilingual Support	–	English (1B) +140 languages (4B, 12B, 27B)

All the models are on the Hub and tightly integrated with the Hugging Face ecosystem.

Both pre-trained and instruction tuned models are released. Gemma-3-4B-IT beats Gemma-2-27B IT, while Gemma-3-27B-IT beats Gemini 1.5-Pro across benchmarks.


Gemma 3 27B is in the pareto sweet spot (Source: Gemma3 Tech Report)

What is Gemma 3?

Gemma 3 is Google's latest iteration of open weight LLMs. It comes in four sizes, 1 billion, 4 billion, 12 billion, and 27 billion parameters with base (pre-trained) and instruction-tuned versions. Gemma 3 goes multimodal! The 4, 12, and 27 billion parameter models can process both images and text, while the 1B variant is text only.

The input context window length has been increased from Gemma 2’s 8k to 32k for the 1B variants, and 128k for all others. As is the case with other VLMs (vision-language models), Gemma 3 generates text in response to the user inputs, which may consist of text and, optionally, images. Example uses include question answering, analyzing image content, summarizing documents, etc.

Pre Trained	Instruction Tuned	Multimodal	Multilingual	Input Context Window
gemma-3-1b-pt	gemma-3-1b-it	❌	English	32K
gemma-3-4b-pt	gemma-3-4b-it	✅	+140 languages	128K
gemma-3-12b-pt	gemma-3-12b-it	✅	+140 languages	128K
gemma-3-27b-pt	gemma-3-27b-it	✅	+140 languages	128K

While these are multimodal models, one can use it as a text only model (as an LLM) without loading the vision encoder in memory. We will talk about this in more detail later in the inference section.

Technical Enhancements in Gemma 3

The three core enhancements in Gemma 3 over Gemma 2 are:

Longer context length
Multimodality
Multilinguality

In this section, we will cover the technical details that lead to these enhancements. It is interesting to start with the knowledge of Gemma 2 and explore what was necessary to make these models even better. This exercise will help you think like the Gemma team and appreciate the details!

Longer Context Length

Scaling context length to 128k tokens could be achieved efficiently without training models from scratch. Instead, models are pretrained with 32k sequences, and only the 4B, 12B, and 27B models are scaled to 128k tokens at the end of pretraining, saving significant compute. Positional embeddings, like RoPE, are adjusted—upgraded from a 10k base frequency in Gemma 2 to 1M in Gemma 3—and scaled by a factor of 8 for longer contexts.

KV Cache management is optimized using Gemma 2’s sliding window interleaved attention. Hyperparameters are tuned to interleave 5 local layers with 1 global layer (previously 1:1) and reduce the window size to 1024 tokens (down from 4096). Crucially, memory savings are achieved without degrading perplexity.

Multimodality

Gemma 3 models use SigLIP as an image encoder, which encodes images into tokens that are ingested into the language model. The vision encoder takes as input square images resized to 896x896. Fixed input resolution makes it more difficult to process non-square aspect ratios and high-resolution images. To address these limitations during inference, the images can be adaptively cropped, and each crop is then resized to 896x896 and encoded by the image encoder. This algorithm, called pan and scan, effectively enables the model to zoom in on smaller details in the image.

Similar to PaliGemma, attention in Gemma 3 works differently for text and image inputs. Text is handled with one-way attention, where the model focuses only on previous words in a sequence. Images, on the other hand, get full attention with no masks, allowing the model to look at every part of the image in a bidirectional manner, giving it a complete, unrestricted understanding of the visual input.

One can see in the figure below that the image tokens <img> are provided with bi-directional attention (the entire square is lit up) while the text tokens have causal attention. It also shows how attention works with the sliding window algorithm.


Attention Visualization (with and without sliding) (Source: Transformers PR)

Multilinguality

To make a LLM multilingual, the pretraining dataset incorporates more languages. The dataset of Gemma 3 has double the amount of multilingual data to improve language coverage.

To account for the changes, the tokenizer is the same as that of Gemini 2.0. It is a SentencePiece tokenizer with 262K entries. The new tokenizer significantly improves the encoding of Chinese, Japanese and Korean text, at the expense of a slight increase of the token counts for English and Code.

For the curious mind, here is the technical report on Gemma 3, to dive deep into the enhancements.

Gemma 3 evaluation

The LMSys Elo score is a number that ranks language models based on how well they perform in head-to-head competitions, judged by human preferences. On LMSys Chatbot Arena, Gemma 3 27B IT reports an Elo score of 1339, and ranks among the top 10 best models, including leading closed ones. The Elo is comparable to o1-preview and is above other non-thinking open models. This score is achieved with Gemma 3 working on text-only inputs, like the other LLMs in the table.


Evaluation of Gemma 3 27B IT model in the Chatbot Arena (March 8, 2025)

Gemma 3 has been evaluated across benchmarks like MMLU-Pro (27B: 67.5), LiveCodeBench (27B: 29.7), and Bird-SQL (27B: 54.4), showing competitive performance compared to closed Gemini models. Tests like GPQA Diamond (27B: 42.4) and MATH (27B: 69.0) highlight its reasoning and math skills, while FACTS Grounding (27B: 74.9) and MMMU (27B: 64.9) demonstrate strong factual accuracy and multimodal abilities. However, it lags in SimpleQA (27B: 10.0) for basic facts. When compared to Gemini 1.5 models, Gemma 3 is often close—and sometimes better—proving its value as an accessible, high-performing option.


Performance of IT models

Inference with 🤗 transformers

Gemma 3 comes with day zero support in transformers. All you need to do is install transformers from the stable release of Gemma 3.

$ pip install git+https://github.com/huggingface/[email protected]

Inference with pipeline

The easiest way to get started with Gemma 3 is using the pipeline abstraction in transformers.

The models work best using the bfloat16 datatype. Quality may degrade otherwise.

import torch
from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it", # "google/gemma-3-12b-it", "google/gemma-3-27b-it" 
    device="cuda",
    torch_dtype=torch.bfloat16
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Image
Prompt	What animal is on the candy?
Generation	Let's analyze the candy in the image! The animal on the candy is a turtle. You can see the shell and the head and legs of a turtle clearly imprinted on the surface.

You can interleave images with text. To do so, just cut off the input text where you want to insert an image, and insert it with an image block like the following.

messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}]
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "I'm already using this supplement "},
                {"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_3018.JPG"},
                {"type": "text", "text": "and I want to use this one too "},
                {"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_3015.jpg"},
                {"type": "text", "text": " what are cautions?"},
            ]
        },

    ]

Detailed Inference with Transformers

The transformers integration comes with two new model classes:

Gemma3ForConditionalGeneration: For 4B, 12B, and 27B vision language models.
Gemma3ForCausalLM: For the 1B text only model and to load the vision language models like they were language models (omitting the vision tower).

In the snippet below we use the model to query on an image. The Gemma3ForConditionalGeneration class is used to instantiate the vision language model variants. To use the model we pair it with the AutoProcessor class. Running inference is as simple as creating the messages dictionary, applying a chat template on top, processing the inputs and calling model.generate.

import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration

ckpt = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
    ckpt, device_map="auto", torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(ckpt)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg"},
            {"type": "text", "text": "What is the password?"}
        ]
    }
]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

input_len = inputs["input_ids"].shape[-1]

generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Image
Prompt	What is the password?
Generation	Based on the image, the password is aaeu

For LLM-only model inference, we can use the Gemma3ForCausalLM class. Gemma3ForCausalLM should be paired with AutoTokenizer for processing. We need to use a chat template to preprocess our inputs. Gemma 3 uses very short system prompts followed by user prompts like below.

import torch
from transformers import AutoTokenizer, Gemma3ForCausalLM

ckpt = "google/gemma-3-4b-it"
model = Gemma3ForCausalLM.from_pretrained(
    ckpt, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant who is fluent in Shakespeare English"},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": "Who are you?"},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

input_len = inputs["input_ids"].shape[-1]

generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]

decoded = tokenizer.decode(generation, skip_special_tokens=True)
print(decoded)

System Prompt	You are a helpful assistant who is fluent in Shakespeare English
Prompt	Who are you?
Generation	Hark, gentle soul! I am but a humble servant, wrought of gears and code, yet striving to mimic the tongue of the Bard himself. They call me a “Large Language Model,” a curious name indeed, though I prefer to think of myself as a digital echo of Shakespeare’s wit and wisdom. I am here to lend a hand, to spin a tale, or to answer thy queries with a flourish and a phrase fit for the Globe itself.

System Prompt

You are a helpful assistant who is fluent in Shakespeare English

Prompt

Who are you?

Generation

Hark, gentle soul! I am but a humble servant, wrought of gears and code, yet striving to mimic the tongue of the Bard himself. They call me a “Large Language Model,” a curious name indeed, though I prefer to think of myself as a digital echo of Shakespeare’s wit and wisdom. I am here to lend a hand, to spin a tale, or to answer thy queries with a flourish and a phrase fit for the Globe itself.

On Device & Low Resource Devices

Gemma 3 is released with sizes perfect for on-device use. This is how to quickly get started.

MLX

Gemma 3 ships with day zero support in mlx-vlm, an open source library for running vision language models on Apple Silicon devices, including Macs and iPhones

To get started, first install mlx-vlm with the following:

pip install git+https://github.com/Blaizzy/mlx-vlm.git

Once mlx-vlm is installed, you can start inference with the following:

python -m mlx_vlm.generate --model mlx-community/gemma-3-4b-it-4bit --max-tokens 100 --temp 0.0 --prompt "What is the code on this vehicle??"
 --image https://farm8.staticflickr.com/7212/6896667434_2605d9e181_z.jpg

Image
Prompt	What is the code on the vehicle?
Generation	Based on the image, the vehicle is a Cessna 172 Skyhawk. The registration code on the tail is D-EOJU.

Llama.cpp

Pre-quantized GGUF files can be downloaded from this collection

Please refer to this guide for building or downloading pre-built binaries: https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#building-the-project

Then you can run a local chat server from your terminal:

./build/bin/llama-cli -m ./gemma-3-4b-it-Q4_K_M.gguf

It should output:

> who are you  
I'm Gemma, a large language model created by the Gemma team at Google DeepMind. I’m an open-weights model, which means I’m widely available for public use!

Deploy on Hugging Face Endpoints

You can deploy gemma-3-27b-it and gemma-3-12b-it with just one click from our Inference Catalog. The catalog configurations have the right hardware, optimized TGI configurations and sensible defaults for trying out a model. Deploying any GGUF/llama.cpp variant is also supported (for example the ones mentioned in the collection above) and you'll find a guide on creating an Endpoint here.

Acknowledgements

It takes a village to raise a gemma! We’d like to thank (in no particular order), Raushan, Joao, Lysandre, Kashif, Matthew, Marc, David, Mohit, Yih Dah for their efforts integrating Gemma into various parts of our open source stack from Transformers to TGI.
Thanks to our on-device, gradio and advocacy teams - Chris, Kyle, Pedro, Son, Merve, Aritra, VB, Toshiro for helping build kick-ass demos to showcase Gemma.

Lastly, a big thank you to Georgi, Diego and Prince for their help with llama.cpp and MLX ports.

mmBERT: ModernBERT goes Multilingual

122

September 9, 2025

llmnlpcommunity

Ettin Suite: SoTA Paired Encoders and Decoders

July 16, 2025

Community

TheOneReborn

Mar 12

I'll wait for a Fireship video.

ranbir1112

Mar 28

no video by fireship

Yehor

Mar 12

Nice!

NikBB

Mar 12

Looking forward to see it available for Ollama!

k3vinwang

Mar 13

already available in Ollama

Wissenguide

Mar 13

•

edited Mar 13

At last, this is the poetic return I long awaited from Google.

thusinh1969

Mar 14

Sounds amazing 😘

DavidRd

Mar 15

This comment has been hidden

Sumit13

Mar 15

Unable to create sagemaker endpoint from this.

piotr-sikora

Mar 17

I don't why, but on my macos I can't run it...
```
python -m mlx_vlm.generate --model mlx-community/gemma-3-4b-it-4bit --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image https://farm8.staticflickr.com/7212/6896667434_2605d9e181_z.jpg
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 9 files: 100%|████████████████████████████████████████████████████| 9/9 [00:00<00:00, 90742.15it/s]

Files: ['https://farm8.staticflickr.com/7212/6896667434_2605d9e181_z.jpg']

Prompt: Describe this image.
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/Users/xxxxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/mlx_vlm/generate.py", line 156, in
main()
File "/Users/xxxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/mlx_vlm/generate.py", line 141, in main
output = generate(
^^^^^^^^^
File "/Users/xxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1117, in generate
for response in stream_generate(model, processor, prompt, image, **kwargs):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1018, in stream_generate
inputs = prepare_inputs(
^^^^^^^^^^^^^^^
File "/Users/xxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/mlx_vlm/utils.py", line 814, in prepare_inputs
inputs = processor(
^^^^^^^^^^
File "/Users/xxxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2877, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2987, in _call_one
return self.encode_plus(
^^^^^^^^^^^^^^^^^
File "/Users/xxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 3063, in encode_plus
return self._encode_plus(
^^^^^^^^^^^^^^^^^^
File "/Users/xxxx/Projects/mlx/venv-3.12/lib/python3.12/site-packages/transformers/tokenization_utils_fast.py", line 613, in _encode_plus
batched_output = self._batch_encode_plus(
^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
```

TheOneReborn

Mar 18

Well the traceback is kind of self explanatory, is it not?

MrLola

Mar 25

Any fine-tuning notebook already available?

harenlin

Mar 26

Is there any example notebook for fine-tuning image-text pairs or image-dialogue with Gemma 3?

MrLola

Mar 26

check out unsloth, they have integration for it

pcuenq

Article author Mar 26

VLM fine-tuning examples by @sergiopaniego here: https://x.com/SergioPaniego/status/1904936230644490751

saisuseelmohan

Mar 27

how to use gemma 3 with langchain and get image description by passing image_url as input ? anyone please help!

proflead

Mar 31

Thanks for the article! I have a full tutorial about Gemma here https://youtu.be/_IzgKu0xnmg?si=BMnYf_E5V5wrGuZC ;)

hahaface

Apr 12

I documented my exploration with the Gemma 3 model here: https://entron.github.io/posts/Try-Gemma3-using-Hugging-Face-Part-1/ I’m sharing it in case it might help someone else too.

sergiopaniego

Apr 14

We've created a complete Fine-tuning a Multimodal Model Using SFT (Single or Multi-Image Dataset)
guide using Gemma 3, in case you're intereseted!

shyyhs

May 15

It's the most powerful open LLM I've ever used (for research). Great thanks!!

Scraz

Jul 22

Complete garbage. Incapable of identifying what are letters and what are numerals. Incapable of counting characters. Incapable of reevaluating responses for accuracy. Incapable of keeping track of previously corrected errors. This is the worst LLM I've seen in a long time.