electroglyph's picture
Upload folder using huggingface_hub
10d7d7b verified
|
raw
history blame
3.14 kB
metadata
license: gemma
base_model:
  - google/embeddinggemma-300m
pipeline_tag: sentence-similarity
library_name: transformers.js
tags:
  - text-embeddings-inference

embeddinggemma-300m-ONNX-uint8

This is based on https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/blob/main/onnx/model_quantized.onnx, but it outputs a uint8 tensor instead of an f32 one.

This model is compatible with Qdrant, but I'm not sure what other vector DBs it's compatible with.

For calibration data I used my own multilingual dataset of around 1.5m tokens: https://github.com/electroglyph/dataset_build

I ran all 1.5m tokens through the model and logged the highest/lowest values seen. I found a range of: -0.19112960994243622 to 0.22116543352603912

So I hacked on the sentence_embedding output of the ONNX model and added QuantizeLinear node based on the range of -0.22116543352603912 to 0.22116543352603912 to keep it symmetric. It would be cool if Qdrant let me specify my own zero point for a little more accuracy, but symmetric will have to do.

Benchmarks

For benchmarking with MTEB I dequantize the uint8 output to the f32 that MTEB expects.

These retrieval benchmarks are a little wild. All the benchmarks used the task: search result query format. I have no idea why this model benchmarks better than the base model on most retrieval tasks, but I'll take it.

mteb retrieval results

mteb totals

Benchmark example code

import mteb
from mteb.encoder_interface import PromptType
import numpy as np
import onnxruntime as rt
from transformers import AutoTokenizer

class CustomModel:
    def __init__(self) -> None:
        self.tokenizer = AutoTokenizer.from_pretrained("C:/LLM/embeddinggemma-300m-ONNX-uint8")
        self.session = rt.InferenceSession("C:/LLM/embeddinggemma-300m-ONNX-uint8/onnx/model.onnx", providers=["CPUExecutionProvider"])
        self.scale = 0.22116543352603912 / 127.0

    def dequantize(self, quantized: list | np.ndarray, scale: float) -> np.ndarray:
        quantized = np.array(quantized)
        dequant = (quantized.astype(np.float32) - 128) * scale
        if dequant.ndim == 3 and dequant.shape[0] == 1:
            return np.squeeze(dequant, axis=0)
        return dequant

    def encode(
        self,
        sentences: list[str],
        task_name: str,
        prompt_type: PromptType | None = None,
        **kwargs,
    ) -> np.ndarray:
        if prompt_type == PromptType.query:
            sentences = [f"task: search result | query: {s}" for s in sentences]
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="np")
        q = self.session.run(["sentence_embedding"], dict(inputs))
        return self.dequantize(q, self.scale)


model = CustomModel()
benchmark = mteb.get_benchmark("NanoBEIR")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model, corpus_chunk_size=128)
for r in results:
    print(r)

FastEmbed usage

You should be able to use this as a custom model with no pooling and no normalization. The sentence_embedding output is ready to use.