metadata

license: apache-2.0
base_model:
  - Qwen/Qwen3-0.6B-Base
tags:
  - transformers
  - sentence-transformers
  - sentence-similarity
  - feature-extraction

Qwen3-Embedding-0.6B-onnx-int4

This is an onnx version of https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

This model has been dynamically quantized to int4/uint8, and further modified to output a uint8 1024 dim tensor.

You probably don't want to use this model on CPU. I've tested on a Ryzen CPU with VNNI, and it's the same speed as the base f32 model, but with 2% less retrieval accuracy. I'm posting it here in case it's useful for GPU users. Not sure if it actually is, but I already made it so here it is.

This model is compatible with qdrant fastembed, please note these details:

Execute model without pooling and without normalization
Pay attention to the example query format in the code below

Quantization method

I did an int4 quantization pass with block size == 128 (block size 32 was extremely close in accuracy), with the same nodes excluded as from my uint8 model.

Then I quantized the remaining non-excluded nodes to uint8 the same way as here: https://huggingface.co/electroglyph/Qwen3-Embedding-0.6B-onnx-uint8

Here are the nodes I excluded

["/0/auto_model/ConstantOfShape",
"/0/auto_model/Constant_28",
"/0/auto_model/layers.25/post_attention_layernorm/Pow",
"/0/auto_model/layers.26/input_layernorm/Pow",
"/0/auto_model/layers.25/input_layernorm/Pow",
"/0/auto_model/layers.24/post_attention_layernorm/Pow",
"/0/auto_model/layers.24/input_layernorm/Pow",
"/0/auto_model/layers.23/post_attention_layernorm/Pow",
"/0/auto_model/layers.23/input_layernorm/Pow",
"/0/auto_model/layers.22/post_attention_layernorm/Pow",
"/0/auto_model/layers.22/input_layernorm/Pow",
"/0/auto_model/layers.3/input_layernorm/Pow",
"/0/auto_model/layers.4/input_layernorm/Pow",
"/0/auto_model/layers.3/post_attention_layernorm/Pow",
"/0/auto_model/layers.21/post_attention_layernorm/Pow",
"/0/auto_model/layers.5/input_layernorm/Pow",
"/0/auto_model/layers.4/post_attention_layernorm/Pow",
"/0/auto_model/layers.5/post_attention_layernorm/Pow",
"/0/auto_model/layers.6/input_layernorm/Pow",
"/0/auto_model/layers.6/post_attention_layernorm/Pow",
"/0/auto_model/layers.7/input_layernorm/Pow",
"/0/auto_model/layers.8/input_layernorm/Pow",
"/0/auto_model/layers.7/post_attention_layernorm/Pow",
"/0/auto_model/layers.26/post_attention_layernorm/Pow",
"/0/auto_model/layers.9/input_layernorm/Pow",
"/0/auto_model/layers.8/post_attention_layernorm/Pow",
"/0/auto_model/layers.21/input_layernorm/Pow",
"/0/auto_model/layers.20/post_attention_layernorm/Pow",
"/0/auto_model/layers.9/post_attention_layernorm/Pow",
"/0/auto_model/layers.10/input_layernorm/Pow",
"/0/auto_model/layers.20/input_layernorm/Pow",
"/0/auto_model/layers.11/input_layernorm/Pow",
"/0/auto_model/layers.10/post_attention_layernorm/Pow",
"/0/auto_model/layers.12/input_layernorm/Pow",
"/0/auto_model/layers.11/post_attention_layernorm/Pow",
"/0/auto_model/layers.12/post_attention_layernorm/Pow",
"/0/auto_model/layers.13/input_layernorm/Pow",
"/0/auto_model/layers.19/post_attention_layernorm/Pow",
"/0/auto_model/layers.13/post_attention_layernorm/Pow",
"/0/auto_model/layers.14/input_layernorm/Pow",
"/0/auto_model/layers.19/input_layernorm/Pow",
"/0/auto_model/layers.18/post_attention_layernorm/Pow",
"/0/auto_model/layers.14/post_attention_layernorm/Pow",
"/0/auto_model/layers.15/input_layernorm/Pow",
"/0/auto_model/layers.16/input_layernorm/Pow",
"/0/auto_model/layers.15/post_attention_layernorm/Pow",
"/0/auto_model/layers.18/input_layernorm/Pow",
"/0/auto_model/layers.17/post_attention_layernorm/Pow",
"/0/auto_model/layers.17/input_layernorm/Pow",
"/0/auto_model/layers.16/post_attention_layernorm/Pow",
"/0/auto_model/layers.27/post_attention_layernorm/Pow",
"/0/auto_model/layers.27/input_layernorm/Pow",
"/0/auto_model/norm/Pow",
"/0/auto_model/layers.25/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.25/post_attention_layernorm/Add",
"/0/auto_model/layers.26/input_layernorm/Add",
"/0/auto_model/layers.26/input_layernorm/ReduceMean",
"/0/auto_model/layers.25/input_layernorm/ReduceMean",
"/0/auto_model/layers.25/input_layernorm/Add",
"/0/auto_model/layers.24/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.24/post_attention_layernorm/Add",
"/0/auto_model/layers.24/input_layernorm/Add",
"/0/auto_model/layers.24/input_layernorm/ReduceMean",
"/0/auto_model/layers.23/post_attention_layernorm/Add",
"/0/auto_model/layers.23/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.23/input_layernorm/ReduceMean",
"/0/auto_model/layers.23/input_layernorm/Add",
"/0/auto_model/layers.22/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.22/post_attention_layernorm/Add",
"/0/auto_model/layers.26/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.26/post_attention_layernorm/Add",
"/0/auto_model/layers.22/input_layernorm/ReduceMean",
"/0/auto_model/layers.22/input_layernorm/Add",
"/0/auto_model/layers.3/input_layernorm/Add",
"/0/auto_model/layers.3/input_layernorm/ReduceMean",
"/0/auto_model/layers.21/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.21/post_attention_layernorm/Add",
"/0/auto_model/layers.4/input_layernorm/Add",
"/0/auto_model/layers.4/input_layernorm/ReduceMean",
"/0/auto_model/layers.3/post_attention_layernorm/Add",
"/0/auto_model/layers.3/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.5/input_layernorm/Add",
"/0/auto_model/layers.5/input_layernorm/ReduceMean",
"/0/auto_model/layers.4/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.4/post_attention_layernorm/Add",
"/0/auto_model/layers.5/post_attention_layernorm/Add",
"/0/auto_model/layers.5/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.6/input_layernorm/Add",
"/0/auto_model/layers.6/input_layernorm/ReduceMean",
"/0/auto_model/layers.6/post_attention_layernorm/Add",
"/0/auto_model/layers.6/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.7/input_layernorm/Add",
"/0/auto_model/layers.7/input_layernorm/ReduceMean",
"/0/auto_model/layers.8/input_layernorm/ReduceMean",
"/0/auto_model/layers.8/input_layernorm/Add",
"/0/auto_model/layers.7/post_attention_layernorm/Add",
"/0/auto_model/layers.7/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.9/input_layernorm/Add",
"/0/auto_model/layers.9/input_layernorm/ReduceMean",
"/0/auto_model/layers.8/post_attention_layernorm/Add",
"/0/auto_model/layers.8/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.21/input_layernorm/Add",
"/0/auto_model/layers.21/input_layernorm/ReduceMean",
"/0/auto_model/layers.20/post_attention_layernorm/Add",
"/0/auto_model/layers.20/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.9/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.9/post_attention_layernorm/Add",
"/0/auto_model/layers.10/input_layernorm/ReduceMean",
"/0/auto_model/layers.10/input_layernorm/Add",
"/0/auto_model/layers.20/input_layernorm/Add",
"/0/auto_model/layers.20/input_layernorm/ReduceMean",
"/0/auto_model/layers.11/input_layernorm/ReduceMean",
"/0/auto_model/layers.11/input_layernorm/Add",
"/0/auto_model/layers.10/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.10/post_attention_layernorm/Add",
"/0/auto_model/layers.12/input_layernorm/ReduceMean",
"/0/auto_model/layers.12/input_layernorm/Add",
"/0/auto_model/layers.11/post_attention_layernorm/Add",
"/0/auto_model/layers.11/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.12/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.12/post_attention_layernorm/Add",
"/0/auto_model/layers.13/input_layernorm/Add",
"/0/auto_model/layers.13/input_layernorm/ReduceMean",
"/0/auto_model/layers.19/post_attention_layernorm/Add",
"/0/auto_model/layers.19/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.13/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.13/post_attention_layernorm/Add",
"/0/auto_model/layers.14/input_layernorm/Add",
"/0/auto_model/layers.14/input_layernorm/ReduceMean",
"/0/auto_model/layers.19/input_layernorm/ReduceMean",
"/0/auto_model/layers.19/input_layernorm/Add",
"/0/auto_model/layers.18/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.18/post_attention_layernorm/Add",
"/0/auto_model/layers.14/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.14/post_attention_layernorm/Add",
"/0/auto_model/layers.15/input_layernorm/ReduceMean",
"/0/auto_model/layers.15/input_layernorm/Add",
"/0/auto_model/layers.16/input_layernorm/Add",
"/0/auto_model/layers.16/input_layernorm/ReduceMean",
"/0/auto_model/layers.15/post_attention_layernorm/Add",
"/0/auto_model/layers.15/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.18/input_layernorm/Add",
"/0/auto_model/layers.18/input_layernorm/ReduceMean",
"/0/auto_model/layers.17/post_attention_layernorm/Add",
"/0/auto_model/layers.17/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.17/input_layernorm/ReduceMean",
"/0/auto_model/layers.17/input_layernorm/Add",
"/0/auto_model/layers.16/post_attention_layernorm/Add",
"/0/auto_model/layers.16/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.27/post_attention_layernorm/Add",
"/0/auto_model/layers.27/post_attention_layernorm/ReduceMean",
"/0/auto_model/layers.27/input_layernorm/Add",
"/0/auto_model/layers.27/input_layernorm/ReduceMean",
"/0/auto_model/layers.27/self_attn/q_norm/Pow",
"/0/auto_model/layers.14/self_attn/k_norm/Pow",
"/0/auto_model/layers.26/self_attn/q_norm/Pow",
"/0/auto_model/layers.25/self_attn/q_norm/Pow",
"/0/auto_model/layers.26/self_attn/k_norm/Pow",
"/0/auto_model/layers.8/self_attn/k_norm/Pow",
"/0/auto_model/layers.24/self_attn/k_norm/Pow",
"/0/auto_model/layers.24/self_attn/q_norm/Pow",
"/0/auto_model/layers.25/self_attn/k_norm/Pow",
"/0/auto_model/layers.23/self_attn/q_norm/Pow",
"/0/auto_model/layers.27/self_attn/k_norm/Pow",
"/0/auto_model/layers.12/self_attn/k_norm/Pow",
"/0/auto_model/layers.13/self_attn/k_norm/Pow",
"/0/auto_model/layers.2/mlp/down_proj/MatMul",
"/0/auto_model/layers.3/post_attention_layernorm/Cast",
"/0/auto_model/layers.3/Add",
"/0/auto_model/layers.3/Add_1",
"/0/auto_model/layers.4/input_layernorm/Cast",
"/0/auto_model/layers.3/input_layernorm/Cast",
"/0/auto_model/layers.2/Add_1",
"/0/auto_model/layers.4/Add",
"/0/auto_model/layers.4/post_attention_layernorm/Cast",
"/0/auto_model/layers.5/input_layernorm/Cast",
"/0/auto_model/layers.4/Add_1",
"/0/auto_model/layers.5/post_attention_layernorm/Cast",
"/0/auto_model/layers.5/Add",
"/0/auto_model/layers.5/Add_1",
"/0/auto_model/layers.6/input_layernorm/Cast",
"/0/auto_model/layers.7/Add_1",
"/0/auto_model/layers.8/input_layernorm/Cast",
"/0/auto_model/layers.7/Add",
"/0/auto_model/layers.7/post_attention_layernorm/Cast",
"/0/auto_model/layers.6/Add",
"/0/auto_model/layers.6/post_attention_layernorm/Cast",
"/0/auto_model/layers.6/Add_1",
"/0/auto_model/layers.7/input_layernorm/Cast",
"/0/auto_model/layers.8/Add",
"/0/auto_model/layers.8/post_attention_layernorm/Cast",
"/0/auto_model/layers.9/input_layernorm/Cast",
"/0/auto_model/layers.8/Add_1",
"/0/auto_model/layers.9/post_attention_layernorm/Cast",
"/0/auto_model/layers.9/Add",
"/0/auto_model/layers.9/Add_1",
"/0/auto_model/layers.10/input_layernorm/Cast",
"/0/auto_model/layers.11/input_layernorm/Cast",
"/0/auto_model/layers.10/Add_1",
"/0/auto_model/layers.10/Add",
"/0/auto_model/layers.10/post_attention_layernorm/Cast",
"/0/auto_model/layers.11/Add",
"/0/auto_model/layers.11/post_attention_layernorm/Cast",
"/0/auto_model/layers.11/Add_1",
"/0/auto_model/layers.12/input_layernorm/Cast",
"/0/auto_model/layers.12/Add",
"/0/auto_model/layers.12/post_attention_layernorm/Cast",
"/0/auto_model/layers.12/Add_1",
"/0/auto_model/layers.13/input_layernorm/Cast",
"/0/auto_model/layers.13/Add",
"/0/auto_model/layers.13/post_attention_layernorm/Cast",
"/0/auto_model/layers.14/input_layernorm/Cast",
"/0/auto_model/layers.13/Add_1",
"/0/auto_model/layers.14/Add_1",
"/0/auto_model/layers.15/input_layernorm/Cast",
"/0/auto_model/layers.14/post_attention_layernorm/Cast",
"/0/auto_model/layers.14/Add",
"/0/auto_model/layers.15/post_attention_layernorm/Cast",
"/0/auto_model/layers.15/Add_1",
"/0/auto_model/layers.16/input_layernorm/Cast",
"/0/auto_model/layers.15/Add",
"/0/auto_model/layers.17/input_layernorm/Cast",
"/0/auto_model/layers.16/Add_1",
"/0/auto_model/layers.16/Add",
"/0/auto_model/layers.16/post_attention_layernorm/Cast",
"/0/auto_model/layers.19/input_layernorm/Cast",
"/0/auto_model/layers.18/Add_1",
"/0/auto_model/layers.18/input_layernorm/Cast",
"/0/auto_model/layers.17/Add_1",
"/0/auto_model/layers.17/Add",
"/0/auto_model/layers.17/post_attention_layernorm/Cast",
"/0/auto_model/layers.18/post_attention_layernorm/Cast",
"/0/auto_model/layers.18/Add",
"/0/auto_model/layers.19/Add",
"/0/auto_model/layers.19/post_attention_layernorm/Cast",
"/0/auto_model/layers.22/Add_1",
"/0/auto_model/layers.23/input_layernorm/Cast",
"/0/auto_model/layers.20/Add_1",
"/0/auto_model/layers.21/input_layernorm/Cast",
"/0/auto_model/layers.21/Add_1",
"/0/auto_model/layers.22/input_layernorm/Cast",
"/0/auto_model/layers.19/Add_1",
"/0/auto_model/layers.20/input_layernorm/Cast",
"/0/auto_model/layers.24/input_layernorm/Cast",
"/0/auto_model/layers.23/Add_1",
"/0/auto_model/layers.22/Add",
"/0/auto_model/layers.22/post_attention_layernorm/Cast",
"/0/auto_model/layers.21/Add",
"/0/auto_model/layers.21/post_attention_layernorm/Cast",
"/0/auto_model/layers.20/Add",
"/0/auto_model/layers.20/post_attention_layernorm/Cast",
"/0/auto_model/layers.23/post_attention_layernorm/Cast",
"/0/auto_model/layers.23/Add",
"/0/auto_model/layers.25/input_layernorm/Cast",
"/0/auto_model/layers.24/Add_1",
"/0/auto_model/layers.24/post_attention_layernorm/Cast",
"/0/auto_model/layers.24/Add",
"/0/auto_model/layers.25/Add",
"/0/auto_model/layers.25/post_attention_layernorm/Cast",
"/0/auto_model/layers.25/Add_1",
"/0/auto_model/layers.26/input_layernorm/Cast",
"/0/auto_model/layers.26/Add",
"/0/auto_model/layers.26/post_attention_layernorm/Cast",
"/0/auto_model/layers.21/self_attn/q_norm/Pow",
"/0/auto_model/layers.26/Add_1",
"/0/auto_model/layers.27/input_layernorm/Cast",
"/0/auto_model/layers.27/Add",
"/0/auto_model/layers.27/post_attention_layernorm/Cast",
"/0/auto_model/norm/Add",
"/0/auto_model/norm/ReduceMean",
"/0/auto_model/layers.23/self_attn/k_norm/Pow",
"/0/auto_model/layers.21/self_attn/k_norm/Pow",
"/0/auto_model/layers.22/self_attn/k_norm/Pow",
"/0/auto_model/layers.10/self_attn/k_norm/Pow",
"/0/auto_model/layers.19/self_attn/q_norm/Pow",
"/0/auto_model/layers.2/mlp/Mul",
"/0/auto_model/layers.22/self_attn/q_norm/Pow",
"/0/auto_model/layers.11/self_attn/k_norm/Pow",
"/0/auto_model/layers.20/self_attn/q_norm/Pow",
"/0/auto_model/layers.20/self_attn/k_norm/Pow",
"/0/auto_model/layers.18/self_attn/q_norm/Pow",
"/0/auto_model/layers.17/self_attn/q_norm/Pow",
"/0/auto_model/layers.27/mlp/down_proj/MatMul",
"/0/auto_model/layers.19/self_attn/k_norm/Pow",
"/0/auto_model/layers.27/Add_1",
"/0/auto_model/norm/Cast",
"/0/auto_model/layers.16/self_attn/k_norm/Pow",
"/0/auto_model/layers.18/self_attn/k_norm/Pow",
"/0/auto_model/layers.11/self_attn/q_norm/Pow",
"/0/auto_model/layers.9/self_attn/q_norm/Pow",
"/0/auto_model/layers.26/self_attn/q_norm/Add",
"/0/auto_model/layers.26/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.14/self_attn/k_norm/Add",
"/0/auto_model/layers.14/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.16/self_attn/q_norm/Pow",
"/0/auto_model/layers.27/mlp/Mul",
"/0/auto_model/layers.27/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.27/self_attn/q_norm/Add",
"/0/auto_model/layers.9/self_attn/k_norm/Pow",
"/0/auto_model/layers.17/self_attn/k_norm/Pow",
"/0/auto_model/layers.26/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.26/self_attn/k_norm/Add",
"/0/auto_model/layers.25/self_attn/k_norm/Add",
"/0/auto_model/layers.25/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.13/self_attn/k_norm/Add",
"/0/auto_model/layers.13/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.10/self_attn/q_norm/Pow",
"/0/auto_model/layers.25/input_layernorm/Mul_1",
"/0/auto_model/layers.27/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.27/self_attn/k_norm/Add",
"/0/auto_model/layers.26/input_layernorm/Mul_1",
"/0/auto_model/layers.15/self_attn/q_norm/Pow",
"/0/auto_model/layers.12/self_attn/k_norm/Add",
"/0/auto_model/layers.12/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.25/self_attn/q_norm/Add",
"/0/auto_model/layers.25/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.24/input_layernorm/Mul_1",
"/0/auto_model/layers.12/self_attn/q_norm/Pow",
"/0/auto_model/layers.24/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.24/self_attn/q_norm/Add",
"/0/auto_model/layers.24/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.24/self_attn/k_norm/Add",
"/0/auto_model/layers.22/mlp/Mul",
"/0/auto_model/layers.2/post_attention_layernorm/Pow",
"/0/auto_model/layers.23/mlp/Mul",
"/0/auto_model/layers.24/mlp/Mul",
"/0/auto_model/layers.23/input_layernorm/Mul_1",
"/0/auto_model/layers.14/self_attn/q_norm/Pow",
"/0/auto_model/layers.14/self_attn/k_proj/MatMul",
"/0/auto_model/layers.14/self_attn/k_norm/Cast",
"/0/auto_model/layers.14/self_attn/Reshape_1",
"/0/auto_model/layers.21/mlp/Mul",
"/0/auto_model/layers.3/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.3/input_layernorm/Sqrt",
"/0/auto_model/layers.4/input_layernorm/Sqrt",
"/0/auto_model/layers.5/input_layernorm/Sqrt",
"/0/auto_model/layers.4/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.5/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.6/input_layernorm/Sqrt",
"/0/auto_model/layers.6/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.8/input_layernorm/Sqrt",
"/0/auto_model/layers.8/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.7/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.7/input_layernorm/Sqrt",
"/0/auto_model/layers.9/input_layernorm/Sqrt",
"/0/auto_model/layers.10/input_layernorm/Sqrt",
"/0/auto_model/layers.9/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.11/input_layernorm/Sqrt",
"/0/auto_model/layers.10/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.12/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.11/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.12/input_layernorm/Sqrt",
"/0/auto_model/layers.13/input_layernorm/Sqrt",
"/0/auto_model/layers.14/input_layernorm/Sqrt",
"/0/auto_model/layers.13/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.15/input_layernorm/Sqrt",
"/0/auto_model/layers.14/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.16/input_layernorm/Sqrt",
"/0/auto_model/layers.15/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.17/input_layernorm/Sqrt",
"/0/auto_model/layers.16/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.19/input_layernorm/Sqrt",
"/0/auto_model/layers.17/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.18/input_layernorm/Sqrt",
"/0/auto_model/layers.18/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.19/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.23/input_layernorm/Sqrt",
"/0/auto_model/layers.20/input_layernorm/Sqrt",
"/0/auto_model/layers.21/input_layernorm/Sqrt",
"/0/auto_model/layers.22/input_layernorm/Sqrt",
"/0/auto_model/layers.22/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.24/input_layernorm/Sqrt",
"/0/auto_model/layers.20/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.21/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.23/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.25/input_layernorm/Sqrt",
"/0/auto_model/layers.24/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.25/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.26/input_layernorm/Sqrt",
"/0/auto_model/layers.26/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.15/self_attn/k_norm/Pow",
"/0/auto_model/layers.27/input_layernorm/Sqrt",
"/0/auto_model/layers.27/post_attention_layernorm/Sqrt",
"/0/auto_model/layers.2/input_layernorm/Pow",
"/0/auto_model/layers.26/mlp/Mul",
"/0/auto_model/layers.23/self_attn/q_norm/Add",
"/0/auto_model/layers.23/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.13/self_attn/q_norm/Pow",
"/0/auto_model/layers.21/self_attn/q_norm/Add",
"/0/auto_model/layers.21/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.6/self_attn/q_norm/Pow",
"/0/auto_model/layers.27/self_attn/Reshape_7",
"/0/auto_model/layers.27/self_attn/MatMul_1",
"/0/auto_model/layers.27/self_attn/Transpose_4",
"/0/auto_model/layers.26/self_attn/Expand_1",
"/0/auto_model/layers.26/self_attn/Unsqueeze_19",
"/0/auto_model/layers.26/self_attn/v_proj/MatMul",
"/0/auto_model/layers.26/self_attn/Transpose_2",
"/0/auto_model/layers.26/self_attn/Reshape_6",
"/0/auto_model/layers.26/self_attn/Reshape_2",
"/0/auto_model/layers.11/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.11/self_attn/k_norm/Add",
"/0/auto_model/layers.22/input_layernorm/Mul_1",
"/0/auto_model/layers.25/mlp/Mul",
"/0/auto_model/layers.8/self_attn/k_norm/Cast",
"/0/auto_model/layers.8/self_attn/k_proj/MatMul",
"/0/auto_model/layers.8/self_attn/Reshape_1",
"/0/auto_model/layers.21/input_layernorm/Mul_1",
"/0/auto_model/layers.5/self_attn/q_norm/Pow",
"/0/auto_model/layers.22/self_attn/q_norm/ReduceMean",
"/0/auto_model/layers.22/self_attn/q_norm/Add",
"/0/auto_model/layers.22/mlp/down_proj/MatMul",
"/0/auto_model/layers.23/self_attn/k_norm/ReduceMean",
"/0/auto_model/layers.23/self_attn/k_norm/Add",
"/0/auto_model/layers.23/mlp/down_proj/MatMul",
"/0/auto_model/layers.26/mlp/down_proj/MatMul",
"/0/auto_model/layers.1/self_attn/Add_2",
"/0/auto_model/layers.2/self_attn/Add_2",
"/0/auto_model/layers.6/self_attn/Add_2",
"/0/auto_model/layers.11/self_attn/Add_2",
"/0/auto_model/layers.12/self_attn/Add_2",
"/0/auto_model/layers.16/self_attn/Add_2",
"/0/auto_model/layers.21/self_attn/Add_2",
"/0/auto_model/layers.24/self_attn/Add_2",
"/0/auto_model/layers.0/self_attn/Add_2",
"/0/auto_model/layers.8/self_attn/Add_2",
"/0/auto_model/layers.13/self_attn/Add_2",
"/0/auto_model/layers.26/self_attn/Add_2",
"/0/auto_model/layers.3/self_attn/Add_2",
"/0/auto_model/layers.15/self_attn/Add_2",
"/0/auto_model/layers.25/self_attn/Add_2",
"/0/auto_model/layers.4/self_attn/Add_2",
"/0/auto_model/layers.14/self_attn/Add_2",
"/0/auto_model/layers.22/self_attn/Add_2",
"/0/auto_model/layers.9/self_attn/Add_2",
"/0/auto_model/layers.23/self_attn/Add_2",
"/0/auto_model/layers.10/self_attn/Add_2",
"/0/auto_model/layers.5/self_attn/Add_2",
"/0/auto_model/layers.19/self_attn/Add_2",
"/0/auto_model/layers.7/self_attn/Add_2",
"/0/auto_model/layers.27/self_attn/Add_2",
"/0/auto_model/layers.18/self_attn/Add_2",
"/0/auto_model/layers.20/self_attn/Add_2",
"/0/auto_model/layers.17/self_attn/Add_2",
"/0/auto_model/Slice_1",
"/0/auto_model/layers.5/self_attn/Slice_4",
"/0/auto_model/layers.12/self_attn/Slice_4",
"/0/auto_model/layers.18/self_attn/Slice_4",
"/0/auto_model/layers.3/self_attn/Slice_4",
"/0/auto_model/layers.11/self_attn/Slice_4",
"/0/auto_model/layers.22/self_attn/Slice_4",
"/0/auto_model/Expand",
"/0/auto_model/layers.4/self_attn/Slice_4",
"/0/auto_model/Slice_2",
"/0/auto_model/layers.8/self_attn/Slice_4",
"/0/auto_model/layers.2/self_attn/Slice_4",
"/0/auto_model/layers.15/self_attn/Slice_4",
"/0/auto_model/layers.26/self_attn/Slice_4",
"/0/auto_model/layers.24/self_attn/Slice_4",
"/0/auto_model/Expand_1",
"/0/auto_model/layers.14/self_attn/Slice_4",
"/0/auto_model/layers.21/self_attn/Slice_4",
"/0/auto_model/layers.1/self_attn/Slice_4",
"/0/auto_model/Reshape_2",
"/0/auto_model/layers.19/self_attn/Slice_4",
"/0/auto_model/Slice",
"/0/auto_model/layers.6/self_attn/Slice_4",
"/0/auto_model/layers.0/self_attn/Slice_4",
"/0/auto_model/layers.25/self_attn/Slice_4",
"/0/auto_model/Unsqueeze_4",
"/0/auto_model/layers.10/self_attn/Slice_4",
"/0/auto_model/layers.23/self_attn/Slice_4",
"/0/auto_model/layers.17/self_attn/Slice_4",
"/0/auto_model/Where_1",
"/0/auto_model/layers.27/self_attn/Slice_4",
"/0/auto_model/layers.20/self_attn/Slice_4",
"/0/auto_model/Add",
"/0/auto_model/Mul",
"/0/auto_model/layers.7/self_attn/Slice_4",
"/0/auto_model/layers.13/self_attn/Slice_4",
"/0/auto_model/layers.9/self_attn/Slice_4",
"/0/auto_model/layers.16/self_attn/Slice_4",
"/0/auto_model/Unsqueeze_3",
"/0/auto_model/ScatterND"]

Benchmarks

Speed

Method = Big chunk of text x10 runs

Seconds elapsed for dynamic_int4.onnx: 45.37 (this model)

Seconds elapsed for opt_f32.onnx: 46.07 (base f32 model preprocessed for quantization)

Seconds elapsed for dynamic_uint8.onnx: 34.61 (probably the one you want to use on CPU)

Verdict: This model kinda sucks on CPU. Let me know how it is on GPU please.

Accuracy

I used beir-qdrant with the scifact dataset.

This retrieval benchmark isn't the greatest result.

I welcome any additional benchmarks by the community, please feel free to share any further results.

If someone wants to sponsor me with an NVIDIA GPU I can have a much faster turnaround time with my model experiments and explore some different quantization strategies.

onnx f32 model with f32 output (baseline):

ndcg: {'NDCG@1': 0.57, 'NDCG@3': 0.65655, 'NDCG@5': 0.68177, 'NDCG@10': 0.69999, 'NDCG@100': 0.72749, 'NDCG@1000': 0.73301}
recall: {'Recall@1': 0.53828, 'Recall@3': 0.71517, 'Recall@5': 0.77883, 'Recall@10': 0.83056, 'Recall@100': 0.95333, 'Recall@1000': 0.99667}
precision: {'P@1': 0.57, 'P@3': 0.26111, 'P@5': 0.17467, 'P@10': 0.09467, 'P@100': 0.01083, 'P@1000': 0.00113}

onnx dynamic int4/uint8 model with f32 output (this model's parent):

ndcg: {'NDCG@1': 0.55333, 'NDCG@3': 0.6491, 'NDCG@5': 0.6674, 'NDCG@10': 0.69277, 'NDCG@100': 0.7183, 'NDCG@1000': 0.72434}
recall: {'Recall@1': 0.52161, 'Recall@3': 0.71739, 'Recall@5': 0.7645, 'Recall@10': 0.83656, 'Recall@100': 0.95, 'Recall@1000': 0.99667}
precision: {'P@1': 0.55333, 'P@3': 0.26222, 'P@5': 0.17067, 'P@10': 0.095, 'P@100': 0.0108, 'P@1000': 0.00113}

onnx dynamic int4/uint8 model with uint8 output (this model):

ndcg: {'NDCG@1': 0.55333, 'NDCG@3': 0.64613, 'NDCG@5': 0.67406, 'NDCG@10': 0.68834, 'NDCG@100': 0.71482, 'NDCG@1000': 0.72134}
recall: {'Recall@1': 0.52161, 'Recall@3': 0.70961, 'Recall@5': 0.77828, 'Recall@10': 0.81822, 'Recall@100': 0.94333, 'Recall@1000': 0.99333}
precision: {'P@1': 0.55333, 'P@3': 0.25889, 'P@5': 0.17533, 'P@10': 0.09333, 'P@100': 0.01073, 'P@1000': 0.00112}

Example inference/benchmark code and how to use the model with Fastembed

After installing beir-qdrant make sure to upgrade fastembed.

# pip install qdrant_client beir-qdrant
# pip install -U fastembed
from fastembed import TextEmbedding
from fastembed.common.model_description import PoolingType, ModelSource
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from qdrant_client import QdrantClient
from qdrant_client.models import Datatype
from beir_qdrant.retrieval.models.fastembed import DenseFastEmbedModelAdapter
from beir_qdrant.retrieval.search.dense import DenseQdrantSearch

TextEmbedding.add_custom_model(
    model="electroglyph/Qwen3-Embedding-0.6B-onnx-int4",
    pooling=PoolingType.DISABLED,
    normalization=False,
    sources=ModelSource(hf="electroglyph/Qwen3-Embedding-0.6B-onnx-int4"),
    dim=1024,
    model_file="dynamic_int4.onnx",
)

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

# IMPORTANT: USE THIS (OR A SIMILAR) QUERY FORMAT WITH THIS MODEL:
for k in queries.keys():
    queries[k] = (
        f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {queries[k]}"
    )

qdrant_client = QdrantClient("http://localhost:6333")

model = DenseQdrantSearch(
    qdrant_client,
    model=DenseFastEmbedModelAdapter(model_name="Qwen3-Embedding-0.6B-onnx-uint8"),
    collection_name="scifact-qwen3-uint8",
    initialize=True,
    datatype=Datatype.UINT8,
)

retriever = EvaluateRetrieval(model)
results = retriever.retrieve(corpus, queries)

ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
print(f"ndcg: {ndcg}\nrecall: {recall}\nprecision: {precision}")