This is a submodel derived from
google/gemma-3n-E4B-it
. It has been modified by slicing specific layers and resizing FFN dimensions. It is not the original model. To learn more about MatFormers, please review the launch blog and generate your own submodels with the MatFormer Lab.
Skipped layers: []
FFN hidden dimensions: [2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8]
This repository corresponds to the launch version of Gemma 3n E4B IT (Instruct), to be used with Hugging Face
transformers
, supporting text, audio, and vision (image and video) inputs.Gemma 3n models have multiple architecture innovations:
- They are available in two sizes based on effective parameters. While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator.
- They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (an E2B), or you can access a spectrum of custom-sized models using the Mix-and-Match method.
Learn more about these techniques in the technical blog post and the Gemma documentation.
Gemma 3n model card
Model Page: Gemma 3n
Resources and Technical Documentation:
Terms of Use: Terms
Authors: Google DeepMind
Model Information
Summary description and brief definition of inputs and outputs.
Description
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.
Inputs and outputs
- Input:
- Text string, such as a question, a prompt, or a document to be summarized
- Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each
- Audio data encoded to 6.25 tokens per second from a single channel
- Total input context of 32K tokens
- Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
- Total output length up to 32K tokens, subtracting the request input tokens
Usage
Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3n is supported starting from transformers 4.53.0.
$ pip install -U transformers
Then, copy the snippet from the section that is relevant for your use case.
Running with the pipeline
API
You can initialize the model and processor for inference with pipeline
as
follows.
from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
model="pranjal-pravesh/gemma-3n-E3B",
device="cuda",
torch_dtype=torch.bfloat16,
)
With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline.
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
}
]
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
# Okay, let's take a look!
# Based on the image, the animal on the candy is a **turtle**.
# You can see the shell shape and the head and legs.
Running the model on a single GPU
from transformers import AutoProcessor, Gemma3nForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "pranjal-pravesh/gemma-3n-E3B"
model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,).eval()
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
# **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
# focusing on a cluster of pink cosmos flowers and a busy bumblebee.
# It has a slightly soft, natural feel, likely captured in daylight.
- Downloads last month
- 8