metadata

license: gemma
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - text-embeddings-inference
extra_gated_heading: Access EmbeddingGemma on Hugging Face
extra_gated_prompt: >-
  To access EmbeddingGemma on Hugging Face, you’re required to review and agree
  to Google’s usage license. To do this, please ensure you’re logged in to
  Hugging Face and click below. Requests are processed immediately.
extra_gated_button_content: Acknowledge license

litert-community/embeddinggemma-300m

Main Model Card: google/embeddinggemma-300m

Overview

This model card provides a few variants of the EmbeddingGemma model that are ready for deployment on Android and iOS using LiteRT, or on Android via the Google AI Edge RAG Library.

Use the models

LiteRT

Try out the demo example on GitHub.

RAG

Try out the EmbeddingGemma model in the in the Google AI Edge RAG Library. You can find the SDK on GitHub or follow our Android guide to install directly from Maven. We have also published a sample app.
Use the sentencepiece model as the tokenizer for the EmbeddingGemma model.

Performance

Android

Note that all benchmark stats are from a Samsung S25 Ultra.

Backend	Quantization	Max sequence length	Init time (ms)	Inference time (ms)	Memory (RSS in MB)	Model size (MB)
GPU	Mixed Precision*	256	1175	64	762	179
GPU	Mixed Precision*	512	1445	119	762	179
GPU	Mixed Precision*	1024	1545	241	771	183
GPU	Mixed Precision*	2048	1707	683	786	196
CPU	Mixed Precision*	256	17.6	66	110	179
CPU	Mixed Precision*	512	24.9	169	123	179
CPU	Mixed Precision*	1024	35.4	549	169	183
CPU	Mixed Precision*	2048	35.8	2455	333	196

*Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4_a8_f4_p4).

Notes:

Init time: the cost paid once per application initialization – subsequent inferences do not pay this cost
Memory: indicator of peak RAM usage
Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads
Benchmark is run with cache enabled and initialized. During the first run, the latency may differ.