Lowercase m

417f3bb 11 days ago

5.67 kB

	---
	license: gemma
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- text-embeddings-inference
	extra_gated_heading: Access EmbeddingGemma on Hugging Face
	extra_gated_prompt: To access EmbeddingGemma on Hugging Face, you’re required to review and
	agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
	Face and click below. Requests are processed immediately.
	extra_gated_button_content: Acknowledge license
	---

	# litert-community/embeddinggemma-300m

	Main Model Card: [google/embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m)

	## Overview

	This model card provides a few variants of the EmbeddingGemma model that are ready for deployment on Android and iOS using [LiteRT](https://ai.google.dev/edge/litert), or on Android via the [Google AI Edge RAG Library](https://ai.google.dev/edge/mediapipe/solutions/genai/rag).

	## Use the models

	### LiteRT

	* Try out the demo [example](https://github.com/google-ai-edge/LiteRT/tree/main/litert/samples/semantic_similarity) on GitHub.

	### RAG

	* Try out the EmbeddingGemma model in the in the [Google AI Edge RAG Library](https://ai.google.dev/edge/mediapipe/solutions/genai/rag). You can find the SDK on [GitHub](https://github.com/google-ai-edge/ai-edge-apis/tree/main/local_agents/rag) or follow our [Android guide](https://ai.google.dev/edge/mediapipe/solutions/genai/rag/android) to install directly from Maven. We have also published a [sample app](https://github.com/google-ai-edge/ai-edge-apis/tree/main/examples/rag).
	* Use the sentencepiece model as the tokenizer for the EmbeddingGemma model.

	## Performance

	### Android

	Note that all benchmark stats are from a Samsung S25 Ultra.

	<table border="1">
	<tr>
	<th>Backend</th>
	<th>Quantization</th>
	<th>Max sequence length</th>
	<th>Init time (ms)</th>
	<th>Inference time (ms)</th>
	<th>Memory (RSS in MB)</th>
	<th>Model size (MB)</th>
	</tr>
	<tr>
	<td><p style="text-align: right">GPU</p></td>
	<td><p style="text-align: right">Mixed Precision*</p></td>
	<td><p style="text-align: right">256</p></td>
	<td><p style="text-align: right">1175</p></td>
	<td><p style="text-align: right">64</p></td>
	<td><p style="text-align: right">762</p></td>
	<td><p style="text-align: right">179</p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">GPU</p></td>
	<td><p style="text-align: right">Mixed Precision*</p></td>
	<td><p style="text-align: right">512</p></td>
	<td><p style="text-align: right">1445</p></td>
	<td><p style="text-align: right">119</p></td>
	<td><p style="text-align: right">762</p></td>
	<td><p style="text-align: right">179</p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">GPU</p></td>
	<td><p style="text-align: right">Mixed Precision*</p></td>
	<td><p style="text-align: right">1024</p></td>
	<td><p style="text-align: right">1545</p></td>
	<td><p style="text-align: right">241</p></td>
	<td><p style="text-align: right">771</p></td>
	<td><p style="text-align: right">183</p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">GPU</p></td>
	<td><p style="text-align: right">Mixed Precision*</p></td>
	<td><p style="text-align: right">2048</p></td>
	<td><p style="text-align: right">1707</p></td>
	<td><p style="text-align: right">683</p></td>
	<td><p style="text-align: right">786</p></td>
	<td><p style="text-align: right">196</p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">CPU</p></td>
	<td><p style="text-align: right">Mixed Precision*</p></td>
	<td><p style="text-align: right">256</p></td>
	<td><p style="text-align: right">17.6</p></td>
	<td><p style="text-align: right">66</p></td>
	<td><p style="text-align: right">110</p></td>
	<td><p style="text-align: right">179</p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">CPU</p></td>
	<td><p style="text-align: right">Mixed Precision*</p></td>
	<td><p style="text-align: right">512</p></td>
	<td><p style="text-align: right">24.9</p></td>
	<td><p style="text-align: right">169</p></td>
	<td><p style="text-align: right">123</p></td>
	<td><p style="text-align: right">179</p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">CPU</p></td>
	<td><p style="text-align: right">Mixed Precision*</p></td>
	<td><p style="text-align: right">1024</p></td>
	<td><p style="text-align: right">35.4</p></td>
	<td><p style="text-align: right">549</p></td>
	<td><p style="text-align: right">169</p></td>
	<td><p style="text-align: right">183</p></td>
	</tr>
	<tr>
	<td><p style="text-align: right">CPU</p></td>
	<td><p style="text-align: right">Mixed Precision*</p></td>
	<td><p style="text-align: right">2048</p></td>
	<td><p style="text-align: right">35.8</p></td>
	<td><p style="text-align: right">2455</p></td>
	<td><p style="text-align: right">333</p></td>
	<td><p style="text-align: right">196</p></td>
	</tr>
	</table>

	*Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4_a8_f4_p4).

	Notes:

	* Init time: the cost paid once per application initialization – subsequent inferences do not pay this cost
	* Memory: indicator of peak RAM usage
	* Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
	* The inference on CPU is accelerated via the LiteRT [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
	* Benchmark is run with cache enabled and initialized. During the first run, the latency may differ.