Qwen3-Embedding-0.6B-onnx-int4 / README.md

Update README.md

eece264 verified about 2 months ago

28.4 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-0.6B-Base
	tags:
	- transformers
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	---

	# Qwen3-Embedding-0.6B-onnx-int4

	This is an onnx version of https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

	This model has been dynamically quantized to int4/uint8, and further modified to output a uint8 1024 dim tensor.

	You probably don't want to use this model on CPU. I've tested on a Ryzen CPU with VNNI, and it's the same speed as the base f32 model, but with 2% less retrieval accuracy. I'm posting it here in case it's useful for GPU users. Not sure if it actually is, but I already made it so here it is.

	This model is compatible with qdrant fastembed, please note these details:

	- Execute model without pooling and without normalization
	- Pay attention to the example query format in the code below

	# Quantization method

	I did an int4 quantization pass with block size == 128 (block size 32 was extremely close in accuracy), with the same nodes excluded as from my uint8 model.

	Then I quantized the remaining non-excluded nodes to uint8 the same way as here: https://huggingface.co/electroglyph/Qwen3-Embedding-0.6B-onnx-uint8

	<details>
	<summary>Here are the nodes I excluded</summary>

	```python
	["/0/auto_model/ConstantOfShape",
	"/0/auto_model/Constant_28",
	"/0/auto_model/layers.25/post_attention_layernorm/Pow",
	"/0/auto_model/layers.26/input_layernorm/Pow",
	"/0/auto_model/layers.25/input_layernorm/Pow",
	"/0/auto_model/layers.24/post_attention_layernorm/Pow",
	"/0/auto_model/layers.24/input_layernorm/Pow",
	"/0/auto_model/layers.23/post_attention_layernorm/Pow",
	"/0/auto_model/layers.23/input_layernorm/Pow",
	"/0/auto_model/layers.22/post_attention_layernorm/Pow",
	"/0/auto_model/layers.22/input_layernorm/Pow",
	"/0/auto_model/layers.3/input_layernorm/Pow",
	"/0/auto_model/layers.4/input_layernorm/Pow",
	"/0/auto_model/layers.3/post_attention_layernorm/Pow",
	"/0/auto_model/layers.21/post_attention_layernorm/Pow",
	"/0/auto_model/layers.5/input_layernorm/Pow",
	"/0/auto_model/layers.4/post_attention_layernorm/Pow",
	"/0/auto_model/layers.5/post_attention_layernorm/Pow",
	"/0/auto_model/layers.6/input_layernorm/Pow",
	"/0/auto_model/layers.6/post_attention_layernorm/Pow",
	"/0/auto_model/layers.7/input_layernorm/Pow",
	"/0/auto_model/layers.8/input_layernorm/Pow",
	"/0/auto_model/layers.7/post_attention_layernorm/Pow",
	"/0/auto_model/layers.26/post_attention_layernorm/Pow",
	"/0/auto_model/layers.9/input_layernorm/Pow",
	"/0/auto_model/layers.8/post_attention_layernorm/Pow",
	"/0/auto_model/layers.21/input_layernorm/Pow",
	"/0/auto_model/layers.20/post_attention_layernorm/Pow",
	"/0/auto_model/layers.9/post_attention_layernorm/Pow",
	"/0/auto_model/layers.10/input_layernorm/Pow",
	"/0/auto_model/layers.20/input_layernorm/Pow",
	"/0/auto_model/layers.11/input_layernorm/Pow",
	"/0/auto_model/layers.10/post_attention_layernorm/Pow",
	"/0/auto_model/layers.12/input_layernorm/Pow",
	"/0/auto_model/layers.11/post_attention_layernorm/Pow",
	"/0/auto_model/layers.12/post_attention_layernorm/Pow",
	"/0/auto_model/layers.13/input_layernorm/Pow",
	"/0/auto_model/layers.19/post_attention_layernorm/Pow",
	"/0/auto_model/layers.13/post_attention_layernorm/Pow",
	"/0/auto_model/layers.14/input_layernorm/Pow",
	"/0/auto_model/layers.19/input_layernorm/Pow",
	"/0/auto_model/layers.18/post_attention_layernorm/Pow",
	"/0/auto_model/layers.14/post_attention_layernorm/Pow",
	"/0/auto_model/layers.15/input_layernorm/Pow",
	"/0/auto_model/layers.16/input_layernorm/Pow",
	"/0/auto_model/layers.15/post_attention_layernorm/Pow",
	"/0/auto_model/layers.18/input_layernorm/Pow",
	"/0/auto_model/layers.17/post_attention_layernorm/Pow",
	"/0/auto_model/layers.17/input_layernorm/Pow",
	"/0/auto_model/layers.16/post_attention_layernorm/Pow",
	"/0/auto_model/layers.27/post_attention_layernorm/Pow",
	"/0/auto_model/layers.27/input_layernorm/Pow",
	"/0/auto_model/norm/Pow",
	"/0/auto_model/layers.25/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.25/post_attention_layernorm/Add",
	"/0/auto_model/layers.26/input_layernorm/Add",
	"/0/auto_model/layers.26/input_layernorm/ReduceMean",
	"/0/auto_model/layers.25/input_layernorm/ReduceMean",
	"/0/auto_model/layers.25/input_layernorm/Add",
	"/0/auto_model/layers.24/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.24/post_attention_layernorm/Add",
	"/0/auto_model/layers.24/input_layernorm/Add",
	"/0/auto_model/layers.24/input_layernorm/ReduceMean",
	"/0/auto_model/layers.23/post_attention_layernorm/Add",
	"/0/auto_model/layers.23/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.23/input_layernorm/ReduceMean",
	"/0/auto_model/layers.23/input_layernorm/Add",
	"/0/auto_model/layers.22/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.22/post_attention_layernorm/Add",
	"/0/auto_model/layers.26/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.26/post_attention_layernorm/Add",
	"/0/auto_model/layers.22/input_layernorm/ReduceMean",
	"/0/auto_model/layers.22/input_layernorm/Add",
	"/0/auto_model/layers.3/input_layernorm/Add",
	"/0/auto_model/layers.3/input_layernorm/ReduceMean",
	"/0/auto_model/layers.21/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.21/post_attention_layernorm/Add",
	"/0/auto_model/layers.4/input_layernorm/Add",
	"/0/auto_model/layers.4/input_layernorm/ReduceMean",
	"/0/auto_model/layers.3/post_attention_layernorm/Add",
	"/0/auto_model/layers.3/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.5/input_layernorm/Add",
	"/0/auto_model/layers.5/input_layernorm/ReduceMean",
	"/0/auto_model/layers.4/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.4/post_attention_layernorm/Add",
	"/0/auto_model/layers.5/post_attention_layernorm/Add",
	"/0/auto_model/layers.5/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.6/input_layernorm/Add",
	"/0/auto_model/layers.6/input_layernorm/ReduceMean",
	"/0/auto_model/layers.6/post_attention_layernorm/Add",
	"/0/auto_model/layers.6/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.7/input_layernorm/Add",
	"/0/auto_model/layers.7/input_layernorm/ReduceMean",
	"/0/auto_model/layers.8/input_layernorm/ReduceMean",
	"/0/auto_model/layers.8/input_layernorm/Add",
	"/0/auto_model/layers.7/post_attention_layernorm/Add",
	"/0/auto_model/layers.7/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.9/input_layernorm/Add",
	"/0/auto_model/layers.9/input_layernorm/ReduceMean",
	"/0/auto_model/layers.8/post_attention_layernorm/Add",
	"/0/auto_model/layers.8/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.21/input_layernorm/Add",
	"/0/auto_model/layers.21/input_layernorm/ReduceMean",
	"/0/auto_model/layers.20/post_attention_layernorm/Add",
	"/0/auto_model/layers.20/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.9/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.9/post_attention_layernorm/Add",
	"/0/auto_model/layers.10/input_layernorm/ReduceMean",
	"/0/auto_model/layers.10/input_layernorm/Add",
	"/0/auto_model/layers.20/input_layernorm/Add",
	"/0/auto_model/layers.20/input_layernorm/ReduceMean",
	"/0/auto_model/layers.11/input_layernorm/ReduceMean",
	"/0/auto_model/layers.11/input_layernorm/Add",
	"/0/auto_model/layers.10/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.10/post_attention_layernorm/Add",
	"/0/auto_model/layers.12/input_layernorm/ReduceMean",
	"/0/auto_model/layers.12/input_layernorm/Add",
	"/0/auto_model/layers.11/post_attention_layernorm/Add",
	"/0/auto_model/layers.11/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.12/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.12/post_attention_layernorm/Add",
	"/0/auto_model/layers.13/input_layernorm/Add",
	"/0/auto_model/layers.13/input_layernorm/ReduceMean",
	"/0/auto_model/layers.19/post_attention_layernorm/Add",
	"/0/auto_model/layers.19/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.13/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.13/post_attention_layernorm/Add",
	"/0/auto_model/layers.14/input_layernorm/Add",
	"/0/auto_model/layers.14/input_layernorm/ReduceMean",
	"/0/auto_model/layers.19/input_layernorm/ReduceMean",
	"/0/auto_model/layers.19/input_layernorm/Add",
	"/0/auto_model/layers.18/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.18/post_attention_layernorm/Add",
	"/0/auto_model/layers.14/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.14/post_attention_layernorm/Add",
	"/0/auto_model/layers.15/input_layernorm/ReduceMean",
	"/0/auto_model/layers.15/input_layernorm/Add",
	"/0/auto_model/layers.16/input_layernorm/Add",
	"/0/auto_model/layers.16/input_layernorm/ReduceMean",
	"/0/auto_model/layers.15/post_attention_layernorm/Add",
	"/0/auto_model/layers.15/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.18/input_layernorm/Add",
	"/0/auto_model/layers.18/input_layernorm/ReduceMean",
	"/0/auto_model/layers.17/post_attention_layernorm/Add",
	"/0/auto_model/layers.17/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.17/input_layernorm/ReduceMean",
	"/0/auto_model/layers.17/input_layernorm/Add",
	"/0/auto_model/layers.16/post_attention_layernorm/Add",
	"/0/auto_model/layers.16/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.27/post_attention_layernorm/Add",
	"/0/auto_model/layers.27/post_attention_layernorm/ReduceMean",
	"/0/auto_model/layers.27/input_layernorm/Add",
	"/0/auto_model/layers.27/input_layernorm/ReduceMean",
	"/0/auto_model/layers.27/self_attn/q_norm/Pow",
	"/0/auto_model/layers.14/self_attn/k_norm/Pow",
	"/0/auto_model/layers.26/self_attn/q_norm/Pow",
	"/0/auto_model/layers.25/self_attn/q_norm/Pow",
	"/0/auto_model/layers.26/self_attn/k_norm/Pow",
	"/0/auto_model/layers.8/self_attn/k_norm/Pow",
	"/0/auto_model/layers.24/self_attn/k_norm/Pow",
	"/0/auto_model/layers.24/self_attn/q_norm/Pow",
	"/0/auto_model/layers.25/self_attn/k_norm/Pow",
	"/0/auto_model/layers.23/self_attn/q_norm/Pow",
	"/0/auto_model/layers.27/self_attn/k_norm/Pow",
	"/0/auto_model/layers.12/self_attn/k_norm/Pow",
	"/0/auto_model/layers.13/self_attn/k_norm/Pow",
	"/0/auto_model/layers.2/mlp/down_proj/MatMul",
	"/0/auto_model/layers.3/post_attention_layernorm/Cast",
	"/0/auto_model/layers.3/Add",
	"/0/auto_model/layers.3/Add_1",
	"/0/auto_model/layers.4/input_layernorm/Cast",
	"/0/auto_model/layers.3/input_layernorm/Cast",
	"/0/auto_model/layers.2/Add_1",
	"/0/auto_model/layers.4/Add",
	"/0/auto_model/layers.4/post_attention_layernorm/Cast",
	"/0/auto_model/layers.5/input_layernorm/Cast",
	"/0/auto_model/layers.4/Add_1",
	"/0/auto_model/layers.5/post_attention_layernorm/Cast",
	"/0/auto_model/layers.5/Add",
	"/0/auto_model/layers.5/Add_1",
	"/0/auto_model/layers.6/input_layernorm/Cast",
	"/0/auto_model/layers.7/Add_1",
	"/0/auto_model/layers.8/input_layernorm/Cast",
	"/0/auto_model/layers.7/Add",
	"/0/auto_model/layers.7/post_attention_layernorm/Cast",
	"/0/auto_model/layers.6/Add",
	"/0/auto_model/layers.6/post_attention_layernorm/Cast",
	"/0/auto_model/layers.6/Add_1",
	"/0/auto_model/layers.7/input_layernorm/Cast",
	"/0/auto_model/layers.8/Add",
	"/0/auto_model/layers.8/post_attention_layernorm/Cast",
	"/0/auto_model/layers.9/input_layernorm/Cast",
	"/0/auto_model/layers.8/Add_1",
	"/0/auto_model/layers.9/post_attention_layernorm/Cast",
	"/0/auto_model/layers.9/Add",
	"/0/auto_model/layers.9/Add_1",
	"/0/auto_model/layers.10/input_layernorm/Cast",
	"/0/auto_model/layers.11/input_layernorm/Cast",
	"/0/auto_model/layers.10/Add_1",
	"/0/auto_model/layers.10/Add",
	"/0/auto_model/layers.10/post_attention_layernorm/Cast",
	"/0/auto_model/layers.11/Add",
	"/0/auto_model/layers.11/post_attention_layernorm/Cast",
	"/0/auto_model/layers.11/Add_1",
	"/0/auto_model/layers.12/input_layernorm/Cast",
	"/0/auto_model/layers.12/Add",
	"/0/auto_model/layers.12/post_attention_layernorm/Cast",
	"/0/auto_model/layers.12/Add_1",
	"/0/auto_model/layers.13/input_layernorm/Cast",
	"/0/auto_model/layers.13/Add",
	"/0/auto_model/layers.13/post_attention_layernorm/Cast",
	"/0/auto_model/layers.14/input_layernorm/Cast",
	"/0/auto_model/layers.13/Add_1",
	"/0/auto_model/layers.14/Add_1",
	"/0/auto_model/layers.15/input_layernorm/Cast",
	"/0/auto_model/layers.14/post_attention_layernorm/Cast",
	"/0/auto_model/layers.14/Add",
	"/0/auto_model/layers.15/post_attention_layernorm/Cast",
	"/0/auto_model/layers.15/Add_1",
	"/0/auto_model/layers.16/input_layernorm/Cast",
	"/0/auto_model/layers.15/Add",
	"/0/auto_model/layers.17/input_layernorm/Cast",
	"/0/auto_model/layers.16/Add_1",
	"/0/auto_model/layers.16/Add",
	"/0/auto_model/layers.16/post_attention_layernorm/Cast",
	"/0/auto_model/layers.19/input_layernorm/Cast",
	"/0/auto_model/layers.18/Add_1",
	"/0/auto_model/layers.18/input_layernorm/Cast",
	"/0/auto_model/layers.17/Add_1",
	"/0/auto_model/layers.17/Add",
	"/0/auto_model/layers.17/post_attention_layernorm/Cast",
	"/0/auto_model/layers.18/post_attention_layernorm/Cast",
	"/0/auto_model/layers.18/Add",
	"/0/auto_model/layers.19/Add",
	"/0/auto_model/layers.19/post_attention_layernorm/Cast",
	"/0/auto_model/layers.22/Add_1",
	"/0/auto_model/layers.23/input_layernorm/Cast",
	"/0/auto_model/layers.20/Add_1",
	"/0/auto_model/layers.21/input_layernorm/Cast",
	"/0/auto_model/layers.21/Add_1",
	"/0/auto_model/layers.22/input_layernorm/Cast",
	"/0/auto_model/layers.19/Add_1",
	"/0/auto_model/layers.20/input_layernorm/Cast",
	"/0/auto_model/layers.24/input_layernorm/Cast",
	"/0/auto_model/layers.23/Add_1",
	"/0/auto_model/layers.22/Add",
	"/0/auto_model/layers.22/post_attention_layernorm/Cast",
	"/0/auto_model/layers.21/Add",
	"/0/auto_model/layers.21/post_attention_layernorm/Cast",
	"/0/auto_model/layers.20/Add",
	"/0/auto_model/layers.20/post_attention_layernorm/Cast",
	"/0/auto_model/layers.23/post_attention_layernorm/Cast",
	"/0/auto_model/layers.23/Add",
	"/0/auto_model/layers.25/input_layernorm/Cast",
	"/0/auto_model/layers.24/Add_1",
	"/0/auto_model/layers.24/post_attention_layernorm/Cast",
	"/0/auto_model/layers.24/Add",
	"/0/auto_model/layers.25/Add",
	"/0/auto_model/layers.25/post_attention_layernorm/Cast",
	"/0/auto_model/layers.25/Add_1",
	"/0/auto_model/layers.26/input_layernorm/Cast",
	"/0/auto_model/layers.26/Add",
	"/0/auto_model/layers.26/post_attention_layernorm/Cast",
	"/0/auto_model/layers.21/self_attn/q_norm/Pow",
	"/0/auto_model/layers.26/Add_1",
	"/0/auto_model/layers.27/input_layernorm/Cast",
	"/0/auto_model/layers.27/Add",
	"/0/auto_model/layers.27/post_attention_layernorm/Cast",
	"/0/auto_model/norm/Add",
	"/0/auto_model/norm/ReduceMean",
	"/0/auto_model/layers.23/self_attn/k_norm/Pow",
	"/0/auto_model/layers.21/self_attn/k_norm/Pow",
	"/0/auto_model/layers.22/self_attn/k_norm/Pow",
	"/0/auto_model/layers.10/self_attn/k_norm/Pow",
	"/0/auto_model/layers.19/self_attn/q_norm/Pow",
	"/0/auto_model/layers.2/mlp/Mul",
	"/0/auto_model/layers.22/self_attn/q_norm/Pow",
	"/0/auto_model/layers.11/self_attn/k_norm/Pow",
	"/0/auto_model/layers.20/self_attn/q_norm/Pow",
	"/0/auto_model/layers.20/self_attn/k_norm/Pow",
	"/0/auto_model/layers.18/self_attn/q_norm/Pow",
	"/0/auto_model/layers.17/self_attn/q_norm/Pow",
	"/0/auto_model/layers.27/mlp/down_proj/MatMul",
	"/0/auto_model/layers.19/self_attn/k_norm/Pow",
	"/0/auto_model/layers.27/Add_1",
	"/0/auto_model/norm/Cast",
	"/0/auto_model/layers.16/self_attn/k_norm/Pow",
	"/0/auto_model/layers.18/self_attn/k_norm/Pow",
	"/0/auto_model/layers.11/self_attn/q_norm/Pow",
	"/0/auto_model/layers.9/self_attn/q_norm/Pow",
	"/0/auto_model/layers.26/self_attn/q_norm/Add",
	"/0/auto_model/layers.26/self_attn/q_norm/ReduceMean",
	"/0/auto_model/layers.14/self_attn/k_norm/Add",
	"/0/auto_model/layers.14/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.16/self_attn/q_norm/Pow",
	"/0/auto_model/layers.27/mlp/Mul",
	"/0/auto_model/layers.27/self_attn/q_norm/ReduceMean",
	"/0/auto_model/layers.27/self_attn/q_norm/Add",
	"/0/auto_model/layers.9/self_attn/k_norm/Pow",
	"/0/auto_model/layers.17/self_attn/k_norm/Pow",
	"/0/auto_model/layers.26/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.26/self_attn/k_norm/Add",
	"/0/auto_model/layers.25/self_attn/k_norm/Add",
	"/0/auto_model/layers.25/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.13/self_attn/k_norm/Add",
	"/0/auto_model/layers.13/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.10/self_attn/q_norm/Pow",
	"/0/auto_model/layers.25/input_layernorm/Mul_1",
	"/0/auto_model/layers.27/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.27/self_attn/k_norm/Add",
	"/0/auto_model/layers.26/input_layernorm/Mul_1",
	"/0/auto_model/layers.15/self_attn/q_norm/Pow",
	"/0/auto_model/layers.12/self_attn/k_norm/Add",
	"/0/auto_model/layers.12/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.25/self_attn/q_norm/Add",
	"/0/auto_model/layers.25/self_attn/q_norm/ReduceMean",
	"/0/auto_model/layers.24/input_layernorm/Mul_1",
	"/0/auto_model/layers.12/self_attn/q_norm/Pow",
	"/0/auto_model/layers.24/self_attn/q_norm/ReduceMean",
	"/0/auto_model/layers.24/self_attn/q_norm/Add",
	"/0/auto_model/layers.24/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.24/self_attn/k_norm/Add",
	"/0/auto_model/layers.22/mlp/Mul",
	"/0/auto_model/layers.2/post_attention_layernorm/Pow",
	"/0/auto_model/layers.23/mlp/Mul",
	"/0/auto_model/layers.24/mlp/Mul",
	"/0/auto_model/layers.23/input_layernorm/Mul_1",
	"/0/auto_model/layers.14/self_attn/q_norm/Pow",
	"/0/auto_model/layers.14/self_attn/k_proj/MatMul",
	"/0/auto_model/layers.14/self_attn/k_norm/Cast",
	"/0/auto_model/layers.14/self_attn/Reshape_1",
	"/0/auto_model/layers.21/mlp/Mul",
	"/0/auto_model/layers.3/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.3/input_layernorm/Sqrt",
	"/0/auto_model/layers.4/input_layernorm/Sqrt",
	"/0/auto_model/layers.5/input_layernorm/Sqrt",
	"/0/auto_model/layers.4/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.5/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.6/input_layernorm/Sqrt",
	"/0/auto_model/layers.6/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.8/input_layernorm/Sqrt",
	"/0/auto_model/layers.8/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.7/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.7/input_layernorm/Sqrt",
	"/0/auto_model/layers.9/input_layernorm/Sqrt",
	"/0/auto_model/layers.10/input_layernorm/Sqrt",
	"/0/auto_model/layers.9/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.11/input_layernorm/Sqrt",
	"/0/auto_model/layers.10/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.12/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.11/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.12/input_layernorm/Sqrt",
	"/0/auto_model/layers.13/input_layernorm/Sqrt",
	"/0/auto_model/layers.14/input_layernorm/Sqrt",
	"/0/auto_model/layers.13/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.15/input_layernorm/Sqrt",
	"/0/auto_model/layers.14/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.16/input_layernorm/Sqrt",
	"/0/auto_model/layers.15/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.17/input_layernorm/Sqrt",
	"/0/auto_model/layers.16/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.19/input_layernorm/Sqrt",
	"/0/auto_model/layers.17/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.18/input_layernorm/Sqrt",
	"/0/auto_model/layers.18/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.19/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.23/input_layernorm/Sqrt",
	"/0/auto_model/layers.20/input_layernorm/Sqrt",
	"/0/auto_model/layers.21/input_layernorm/Sqrt",
	"/0/auto_model/layers.22/input_layernorm/Sqrt",
	"/0/auto_model/layers.22/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.24/input_layernorm/Sqrt",
	"/0/auto_model/layers.20/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.21/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.23/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.25/input_layernorm/Sqrt",
	"/0/auto_model/layers.24/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.25/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.26/input_layernorm/Sqrt",
	"/0/auto_model/layers.26/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.15/self_attn/k_norm/Pow",
	"/0/auto_model/layers.27/input_layernorm/Sqrt",
	"/0/auto_model/layers.27/post_attention_layernorm/Sqrt",
	"/0/auto_model/layers.2/input_layernorm/Pow",
	"/0/auto_model/layers.26/mlp/Mul",
	"/0/auto_model/layers.23/self_attn/q_norm/Add",
	"/0/auto_model/layers.23/self_attn/q_norm/ReduceMean",
	"/0/auto_model/layers.13/self_attn/q_norm/Pow",
	"/0/auto_model/layers.21/self_attn/q_norm/Add",
	"/0/auto_model/layers.21/self_attn/q_norm/ReduceMean",
	"/0/auto_model/layers.6/self_attn/q_norm/Pow",
	"/0/auto_model/layers.27/self_attn/Reshape_7",
	"/0/auto_model/layers.27/self_attn/MatMul_1",
	"/0/auto_model/layers.27/self_attn/Transpose_4",
	"/0/auto_model/layers.26/self_attn/Expand_1",
	"/0/auto_model/layers.26/self_attn/Unsqueeze_19",
	"/0/auto_model/layers.26/self_attn/v_proj/MatMul",
	"/0/auto_model/layers.26/self_attn/Transpose_2",
	"/0/auto_model/layers.26/self_attn/Reshape_6",
	"/0/auto_model/layers.26/self_attn/Reshape_2",
	"/0/auto_model/layers.11/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.11/self_attn/k_norm/Add",
	"/0/auto_model/layers.22/input_layernorm/Mul_1",
	"/0/auto_model/layers.25/mlp/Mul",
	"/0/auto_model/layers.8/self_attn/k_norm/Cast",
	"/0/auto_model/layers.8/self_attn/k_proj/MatMul",
	"/0/auto_model/layers.8/self_attn/Reshape_1",
	"/0/auto_model/layers.21/input_layernorm/Mul_1",
	"/0/auto_model/layers.5/self_attn/q_norm/Pow",
	"/0/auto_model/layers.22/self_attn/q_norm/ReduceMean",
	"/0/auto_model/layers.22/self_attn/q_norm/Add",
	"/0/auto_model/layers.22/mlp/down_proj/MatMul",
	"/0/auto_model/layers.23/self_attn/k_norm/ReduceMean",
	"/0/auto_model/layers.23/self_attn/k_norm/Add",
	"/0/auto_model/layers.23/mlp/down_proj/MatMul",
	"/0/auto_model/layers.26/mlp/down_proj/MatMul",
	"/0/auto_model/layers.1/self_attn/Add_2",
	"/0/auto_model/layers.2/self_attn/Add_2",
	"/0/auto_model/layers.6/self_attn/Add_2",
	"/0/auto_model/layers.11/self_attn/Add_2",
	"/0/auto_model/layers.12/self_attn/Add_2",
	"/0/auto_model/layers.16/self_attn/Add_2",
	"/0/auto_model/layers.21/self_attn/Add_2",
	"/0/auto_model/layers.24/self_attn/Add_2",
	"/0/auto_model/layers.0/self_attn/Add_2",
	"/0/auto_model/layers.8/self_attn/Add_2",
	"/0/auto_model/layers.13/self_attn/Add_2",
	"/0/auto_model/layers.26/self_attn/Add_2",
	"/0/auto_model/layers.3/self_attn/Add_2",
	"/0/auto_model/layers.15/self_attn/Add_2",
	"/0/auto_model/layers.25/self_attn/Add_2",
	"/0/auto_model/layers.4/self_attn/Add_2",
	"/0/auto_model/layers.14/self_attn/Add_2",
	"/0/auto_model/layers.22/self_attn/Add_2",
	"/0/auto_model/layers.9/self_attn/Add_2",
	"/0/auto_model/layers.23/self_attn/Add_2",
	"/0/auto_model/layers.10/self_attn/Add_2",
	"/0/auto_model/layers.5/self_attn/Add_2",
	"/0/auto_model/layers.19/self_attn/Add_2",
	"/0/auto_model/layers.7/self_attn/Add_2",
	"/0/auto_model/layers.27/self_attn/Add_2",
	"/0/auto_model/layers.18/self_attn/Add_2",
	"/0/auto_model/layers.20/self_attn/Add_2",
	"/0/auto_model/layers.17/self_attn/Add_2",
	"/0/auto_model/Slice_1",
	"/0/auto_model/layers.5/self_attn/Slice_4",
	"/0/auto_model/layers.12/self_attn/Slice_4",
	"/0/auto_model/layers.18/self_attn/Slice_4",
	"/0/auto_model/layers.3/self_attn/Slice_4",
	"/0/auto_model/layers.11/self_attn/Slice_4",
	"/0/auto_model/layers.22/self_attn/Slice_4",
	"/0/auto_model/Expand",
	"/0/auto_model/layers.4/self_attn/Slice_4",
	"/0/auto_model/Slice_2",
	"/0/auto_model/layers.8/self_attn/Slice_4",
	"/0/auto_model/layers.2/self_attn/Slice_4",
	"/0/auto_model/layers.15/self_attn/Slice_4",
	"/0/auto_model/layers.26/self_attn/Slice_4",
	"/0/auto_model/layers.24/self_attn/Slice_4",
	"/0/auto_model/Expand_1",
	"/0/auto_model/layers.14/self_attn/Slice_4",
	"/0/auto_model/layers.21/self_attn/Slice_4",
	"/0/auto_model/layers.1/self_attn/Slice_4",
	"/0/auto_model/Reshape_2",
	"/0/auto_model/layers.19/self_attn/Slice_4",
	"/0/auto_model/Slice",
	"/0/auto_model/layers.6/self_attn/Slice_4",
	"/0/auto_model/layers.0/self_attn/Slice_4",
	"/0/auto_model/layers.25/self_attn/Slice_4",
	"/0/auto_model/Unsqueeze_4",
	"/0/auto_model/layers.10/self_attn/Slice_4",
	"/0/auto_model/layers.23/self_attn/Slice_4",
	"/0/auto_model/layers.17/self_attn/Slice_4",
	"/0/auto_model/Where_1",
	"/0/auto_model/layers.27/self_attn/Slice_4",
	"/0/auto_model/layers.20/self_attn/Slice_4",
	"/0/auto_model/Add",
	"/0/auto_model/Mul",
	"/0/auto_model/layers.7/self_attn/Slice_4",
	"/0/auto_model/layers.13/self_attn/Slice_4",
	"/0/auto_model/layers.9/self_attn/Slice_4",
	"/0/auto_model/layers.16/self_attn/Slice_4",
	"/0/auto_model/Unsqueeze_3",
	"/0/auto_model/ScatterND"]
	```

	</details>

	# Benchmarks

	## Speed

	Method = Big chunk of text x10 runs

	Seconds elapsed for dynamic_int4.onnx: 45.37 (this model)

	Seconds elapsed for opt_f32.onnx: 46.07 (base f32 model preprocessed for quantization)

	Seconds elapsed for dynamic_uint8.onnx: 34.61 (probably the one you want to use on CPU)

	Verdict: This model kinda sucks on CPU. Let me know how it is on GPU please.

	## Accuracy

	I used beir-qdrant with the scifact dataset.

	This retrieval benchmark isn't the greatest result.

	I welcome any additional benchmarks by the community, please feel free to share any further results.

	If someone wants to sponsor me with an NVIDIA GPU I can have a much faster turnaround time with my model experiments and explore some different quantization strategies.


	onnx f32 model with f32 output (baseline):

	```
	ndcg: {'NDCG@1': 0.57, 'NDCG@3': 0.65655, 'NDCG@5': 0.68177, 'NDCG@10': 0.69999, 'NDCG@100': 0.72749, 'NDCG@1000': 0.73301}
	recall: {'Recall@1': 0.53828, 'Recall@3': 0.71517, 'Recall@5': 0.77883, 'Recall@10': 0.83056, 'Recall@100': 0.95333, 'Recall@1000': 0.99667}
	precision: {'P@1': 0.57, 'P@3': 0.26111, 'P@5': 0.17467, 'P@10': 0.09467, 'P@100': 0.01083, 'P@1000': 0.00113}
	```

	onnx dynamic int4/uint8 model with f32 output (this model's parent):

	```
	ndcg: {'NDCG@1': 0.55333, 'NDCG@3': 0.6491, 'NDCG@5': 0.6674, 'NDCG@10': 0.69277, 'NDCG@100': 0.7183, 'NDCG@1000': 0.72434}
	recall: {'Recall@1': 0.52161, 'Recall@3': 0.71739, 'Recall@5': 0.7645, 'Recall@10': 0.83656, 'Recall@100': 0.95, 'Recall@1000': 0.99667}
	precision: {'P@1': 0.55333, 'P@3': 0.26222, 'P@5': 0.17067, 'P@10': 0.095, 'P@100': 0.0108, 'P@1000': 0.00113}
	```

	onnx dynamic int4/uint8 model with uint8 output (this model):

	```
	ndcg: {'NDCG@1': 0.55333, 'NDCG@3': 0.64613, 'NDCG@5': 0.67406, 'NDCG@10': 0.68834, 'NDCG@100': 0.71482, 'NDCG@1000': 0.72134}
	recall: {'Recall@1': 0.52161, 'Recall@3': 0.70961, 'Recall@5': 0.77828, 'Recall@10': 0.81822, 'Recall@100': 0.94333, 'Recall@1000': 0.99333}
	precision: {'P@1': 0.55333, 'P@3': 0.25889, 'P@5': 0.17533, 'P@10': 0.09333, 'P@100': 0.01073, 'P@1000': 0.00112}
	```

	# Example inference/benchmark code and how to use the model with Fastembed

	After installing beir-qdrant make sure to upgrade fastembed.

	```python
	# pip install qdrant_client beir-qdrant
	# pip install -U fastembed
	from fastembed import TextEmbedding
	from fastembed.common.model_description import PoolingType, ModelSource
	from beir import util
	from beir.datasets.data_loader import GenericDataLoader
	from beir.retrieval.evaluation import EvaluateRetrieval
	from qdrant_client import QdrantClient
	from qdrant_client.models import Datatype
	from beir_qdrant.retrieval.models.fastembed import DenseFastEmbedModelAdapter
	from beir_qdrant.retrieval.search.dense import DenseQdrantSearch

	TextEmbedding.add_custom_model(
	model="electroglyph/Qwen3-Embedding-0.6B-onnx-int4",
	pooling=PoolingType.DISABLED,
	normalization=False,
	sources=ModelSource(hf="electroglyph/Qwen3-Embedding-0.6B-onnx-int4"),
	dim=1024,
	model_file="dynamic_int4.onnx",
	)

	dataset = "scifact"
	url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
	data_path = util.download_and_unzip(url, "datasets")
	corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

	# IMPORTANT: USE THIS (OR A SIMILAR) QUERY FORMAT WITH THIS MODEL:
	for k in queries.keys():
	queries[k] = (
	f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {queries[k]}"
	)

	qdrant_client = QdrantClient("http://localhost:6333")

	model = DenseQdrantSearch(
	qdrant_client,
	model=DenseFastEmbedModelAdapter(model_name="Qwen3-Embedding-0.6B-onnx-uint8"),
	collection_name="scifact-qwen3-uint8",
	initialize=True,
	datatype=Datatype.UINT8,
	)

	retriever = EvaluateRetrieval(model)
	results = retriever.retrieve(corpus, queries)

	ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
	print(f"ndcg: {ndcg}\nrecall: {recall}\nprecision: {precision}")

	```