license: gemma
base_model:
- google/embeddinggemma-300m
pipeline_tag: sentence-similarity
library_name: transformers.js
tags:
- text-embeddings-inference
embeddinggemma-300m-ONNX-uint8
This is based on https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/blob/main/onnx/model_quantized.onnx, but it outputs a uint8 tensor instead of an f32 one.
This model is compatible with Qdrant, but I'm not sure what other vector DBs it's compatible with.
For calibration data I used my own multilingual dataset of around 1.5m tokens: https://github.com/electroglyph/dataset_build
I ran all 1.5m tokens through the model and logged the highest/lowest values seen. I found a range of: -0.19112960994243622 to 0.22116543352603912
So I hacked on the sentence_embedding output of the ONNX model and added QuantizeLinear node based on the range of -0.22116543352603912 to 0.22116543352603912 to keep it symmetric. It would be cool if Qdrant let me specify my own zero point for a little more accuracy, but symmetric will have to do.
Benchmarks
For benchmarking with MTEB I dequantize the uint8 output to the f32 that MTEB expects.
These retrieval benchmarks are a little wild. All the benchmarks used the task: search result
query format. I have no idea why this model benchmarks better than the base model on most retrieval tasks, but I'll take it.
Benchmark example code
import mteb
from mteb.encoder_interface import PromptType
import numpy as np
import onnxruntime as rt
from transformers import AutoTokenizer
class CustomModel:
def __init__(self) -> None:
self.tokenizer = AutoTokenizer.from_pretrained("C:/LLM/embeddinggemma-300m-ONNX-uint8")
self.session = rt.InferenceSession("C:/LLM/embeddinggemma-300m-ONNX-uint8/onnx/model.onnx", providers=["CPUExecutionProvider"])
self.scale = 0.22116543352603912 / 127.0
def dequantize(self, quantized: list | np.ndarray, scale: float) -> np.ndarray:
quantized = np.array(quantized)
dequant = (quantized.astype(np.float32) - 128) * scale
if dequant.ndim == 3 and dequant.shape[0] == 1:
return np.squeeze(dequant, axis=0)
return dequant
def encode(
self,
sentences: list[str],
task_name: str,
prompt_type: PromptType | None = None,
**kwargs,
) -> np.ndarray:
if prompt_type == PromptType.query:
sentences = [f"task: search result | query: {s}" for s in sentences]
inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="np")
q = self.session.run(["sentence_embedding"], dict(inputs))
return self.dequantize(q, self.scale)
model = CustomModel()
benchmark = mteb.get_benchmark("NanoBEIR")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model, corpus_chunk_size=128)
for r in results:
print(r)
FastEmbed usage
You should be able to use this as a custom model with no pooling and no normalization. The sentence_embedding output is ready to use.