embeddinggemma-300m-ONNX-uint8

Update Sep. 20, 2025: I removed the last_hidden_state output from the model and left only the sentence_embedding one.

This is based on https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/blob/main/onnx/model_quantized.onnx, but it outputs a uint8 tensor instead of an f32 one.

This model is compatible with Qdrant, but I'm not sure what other vector DBs it's compatible with.

For calibration data I used my own multilingual dataset of around 1.5m tokens: https://github.com/electroglyph/dataset_build

I ran all 1.5m tokens through the model and logged the highest/lowest values seen. I found a range of: -0.19112960994243622 to 0.22116543352603912

So I hacked on the sentence_embedding output of the ONNX model and added QuantizeLinear node based on the range of -0.22116543352603912 to 0.22116543352603912 to keep it symmetric. It would be cool if Qdrant let me specify my own zero point for a little more accuracy, but symmetric will have to do.

Note: this model is no longer compatible with SentenceTransformer! Or at least I wasn't able to figure it out right away. It messes with the uint8 output.

Benchmarks

For benchmarking with MTEB I dequantize the uint8 output to the f32 that MTEB expects.

These retrieval benchmarks are a little wild. All the benchmarks used the task: search result query format. I have no idea why this model benchmarks better than the base model on most retrieval tasks, but I'll take it.

mteb retrieval results

mteb totals

Example Benchmark Code

import mteb
from mteb.encoder_interface import PromptType
import numpy as np
import onnxruntime as rt
from transformers import AutoTokenizer

class CustomModel:
    def __init__(self) -> None:
        self.tokenizer = AutoTokenizer.from_pretrained("C:/LLM/embeddinggemma-300m-ONNX-uint8")
        self.session = rt.InferenceSession("C:/LLM/embeddinggemma-300m-ONNX-uint8/onnx/model.onnx", providers=["CPUExecutionProvider"])
        self.scale = 0.22116543352603912 / 127.0

    def dequantize(self, quantized: list | np.ndarray, scale: float) -> np.ndarray:
        quantized = np.array(quantized)
        dequant = (quantized.astype(np.float32) - 128) * scale
        if dequant.ndim == 3 and dequant.shape[0] == 1:
            return np.squeeze(dequant, axis=0)
        return dequant

    def encode(
        self,
        sentences: list[str],
        task_name: str,
        prompt_type: PromptType | None = None,
        **kwargs,
    ) -> np.ndarray:
        if prompt_type == PromptType.query:
            sentences = [f"task: search result | query: {s}" for s in sentences]
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="np")
        q = self.session.run(["sentence_embedding"], dict(inputs))
        return self.dequantize(q, self.scale)


model = CustomModel()
benchmark = mteb.get_benchmark("NanoBEIR")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model, corpus_chunk_size=128)
for r in results:
    print(r)

Example FastEmbed Usage

from fastembed import TextEmbedding
from fastembed.common.model_description import PoolingType, ModelSource

TextEmbedding.add_custom_model(
    model="embeddinggemma-300m-ONNX-uint8",
    pooling=PoolingType.DISABLED,
    normalization=False,
    sources=ModelSource(hf="electroglyph/embeddinggemma-300m-ONNX-uint8"),
    dim=768,
    model_file="onnx/model.onnx",
)

model = TextEmbedding(model_name="embeddinggemma-300m-ONNX-uint8")
embeddings = list(model.embed("test"))
print(embeddings)
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for electroglyph/embeddinggemma-300m-ONNX-uint8

Quantized
(16)
this model