|
--- |
|
license: gemma |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- text-embeddings-inference |
|
extra_gated_heading: Access EmbeddingGemma on Hugging Face |
|
extra_gated_prompt: To access EmbeddingGemma on Hugging Face, you’re required to review and |
|
agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging |
|
Face and click below. Requests are processed immediately. |
|
extra_gated_button_content: Acknowledge license |
|
--- |
|
|
|
# litert-community/embeddinggemma-300m |
|
|
|
Main Model Card: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m) |
|
|
|
## Overview |
|
|
|
This model card provides a few variants of the EmbeddingGemma model that are ready for deployment on Android and iOS using [LiteRT](https://ai.google.dev/edge/litert), or on Android via the [Google AI Edge RAG Library](https://ai.google.dev/edge/mediapipe/solutions/genai/rag). |
|
|
|
## Use the models |
|
|
|
### LiteRT |
|
|
|
* Try out the demo [example](https://github.com/google-ai-edge/LiteRT/tree/main/litert/samples/semantic_similarity) on GitHub. |
|
|
|
### RAG |
|
|
|
* Try out the EmbeddingGemma model in the in the [Google AI Edge RAG Library](https://ai.google.dev/edge/mediapipe/solutions/genai/rag). You can find the SDK on [GitHub](https://github.com/google-ai-edge/ai-edge-apis/tree/main/local_agents/rag) or follow our [Android guide](https://ai.google.dev/edge/mediapipe/solutions/genai/rag/android) to install directly from Maven. We have also published a [sample app](https://github.com/google-ai-edge/ai-edge-apis/tree/main/examples/rag). |
|
* Use the sentencepiece model as the tokenizer for the EmbeddingGemma model. |
|
|
|
## Performance |
|
|
|
### Android |
|
|
|
Note that all benchmark stats are from a Samsung S25 Ultra. |
|
|
|
<table border="1"> |
|
<tr> |
|
<th>Backend</th> |
|
<th>Quantization</th> |
|
<th>Max sequence length</th> |
|
<th>Init time (ms)</th> |
|
<th>Inference time (ms)</th> |
|
<th>Memory (RSS in MB)</th> |
|
<th>Model size (MB)</th> |
|
</tr> |
|
<tr> |
|
<td><p style="text-align: right">GPU</p></td> |
|
<td><p style="text-align: right">Mixed Precision*</p></td> |
|
<td><p style="text-align: right">256</p></td> |
|
<td><p style="text-align: right">1175</p></td> |
|
<td><p style="text-align: right">64</p></td> |
|
<td><p style="text-align: right">762</p></td> |
|
<td><p style="text-align: right">179</p></td> |
|
</tr> |
|
<tr> |
|
<td><p style="text-align: right">GPU</p></td> |
|
<td><p style="text-align: right">Mixed Precision*</p></td> |
|
<td><p style="text-align: right">512</p></td> |
|
<td><p style="text-align: right">1445</p></td> |
|
<td><p style="text-align: right">119</p></td> |
|
<td><p style="text-align: right">762</p></td> |
|
<td><p style="text-align: right">179</p></td> |
|
</tr> |
|
<tr> |
|
<td><p style="text-align: right">GPU</p></td> |
|
<td><p style="text-align: right">Mixed Precision*</p></td> |
|
<td><p style="text-align: right">1024</p></td> |
|
<td><p style="text-align: right">1545</p></td> |
|
<td><p style="text-align: right">241</p></td> |
|
<td><p style="text-align: right">771</p></td> |
|
<td><p style="text-align: right">183</p></td> |
|
</tr> |
|
<tr> |
|
<td><p style="text-align: right">GPU</p></td> |
|
<td><p style="text-align: right">Mixed Precision*</p></td> |
|
<td><p style="text-align: right">2048</p></td> |
|
<td><p style="text-align: right">1707</p></td> |
|
<td><p style="text-align: right">683</p></td> |
|
<td><p style="text-align: right">786</p></td> |
|
<td><p style="text-align: right">196</p></td> |
|
</tr> |
|
<tr> |
|
<td><p style="text-align: right">CPU</p></td> |
|
<td><p style="text-align: right">Mixed Precision*</p></td> |
|
<td><p style="text-align: right">256</p></td> |
|
<td><p style="text-align: right">17.6</p></td> |
|
<td><p style="text-align: right">66</p></td> |
|
<td><p style="text-align: right">110</p></td> |
|
<td><p style="text-align: right">179</p></td> |
|
</tr> |
|
<tr> |
|
<td><p style="text-align: right">CPU</p></td> |
|
<td><p style="text-align: right">Mixed Precision*</p></td> |
|
<td><p style="text-align: right">512</p></td> |
|
<td><p style="text-align: right">24.9</p></td> |
|
<td><p style="text-align: right">169</p></td> |
|
<td><p style="text-align: right">123</p></td> |
|
<td><p style="text-align: right">179</p></td> |
|
</tr> |
|
<tr> |
|
<td><p style="text-align: right">CPU</p></td> |
|
<td><p style="text-align: right">Mixed Precision*</p></td> |
|
<td><p style="text-align: right">1024</p></td> |
|
<td><p style="text-align: right">35.4</p></td> |
|
<td><p style="text-align: right">549</p></td> |
|
<td><p style="text-align: right">169</p></td> |
|
<td><p style="text-align: right">183</p></td> |
|
</tr> |
|
<tr> |
|
<td><p style="text-align: right">CPU</p></td> |
|
<td><p style="text-align: right">Mixed Precision*</p></td> |
|
<td><p style="text-align: right">2048</p></td> |
|
<td><p style="text-align: right">35.8</p></td> |
|
<td><p style="text-align: right">2455</p></td> |
|
<td><p style="text-align: right">333</p></td> |
|
<td><p style="text-align: right">196</p></td> |
|
</tr> |
|
</table> |
|
|
|
*Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4_a8_f4_p4). |
|
|
|
Notes: |
|
|
|
* Init time: the cost paid once per application initialization – subsequent inferences do not pay this cost |
|
* Memory: indicator of peak RAM usage |
|
* Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models) |
|
* The inference on CPU is accelerated via the LiteRT [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads |
|
* Benchmark is run with cache enabled and initialized. During the first run, the latency may differ. |
|
|
|
|