litert-community/embeddinggemma-300m

Main Model Card: google/embeddinggemma-300m

Overview

This model card provides a few variants of the EmbeddingGemma model that are ready for deployment on Android and iOS using LiteRT, or on Android via the Google AI Edge RAG Library.

Use the models

LiteRT

  • Try out the demo example on GitHub.

RAG

  • Try out the EmbeddingGemma model in the in the Google AI Edge RAG Library. You can find the SDK on GitHub or follow our Android guide to install directly from Maven. We have also published a sample app.
  • Use the sentencepiece model as the tokenizer for the EmbeddingGemma model.

Performance

Android

Note that all benchmark stats are from a Samsung S25 Ultra.

Backend Quantization Max sequence length Init time (ms) Inference time (ms) Memory (RSS in MB) Model size (MB)

GPU

Mixed Precision*

256

1175

64

762

179

GPU

Mixed Precision*

512

1445

119

762

179

GPU

Mixed Precision*

1024

1545

241

771

183

GPU

Mixed Precision*

2048

1707

683

786

196

CPU

Mixed Precision*

256

17.6

66

110

179

CPU

Mixed Precision*

512

24.9

169

123

179

CPU

Mixed Precision*

1024

35.4

549

169

183

CPU

Mixed Precision*

2048

35.8

2455

333

196

*Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4_a8_f4_p4).

Notes:

  • Init time: the cost paid once per application initialization – subsequent inferences do not pay this cost
  • Memory: indicator of peak RAM usage
  • Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
  • The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads
  • Benchmark is run with cache enabled and initialized. During the first run, the latency may differ.
Downloads last month
220
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support