|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- Qwen/Qwen3-0.6B-Base |
|
tags: |
|
- transformers |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
--- |
|
|
|
# Qwen3-Embedding-0.6B-onnx-int4 |
|
|
|
This is an onnx version of https://huggingface.co/Qwen/Qwen3-Embedding-0.6B |
|
|
|
This model has been dynamically quantized to int4/uint8, and further modified to output a uint8 1024 dim tensor. |
|
|
|
You probably don't want to use this model on CPU. I've tested on a Ryzen CPU with VNNI, and it's the same speed as the base f32 model, but with 2% less retrieval accuracy. I'm posting it here in case it's useful for GPU users. Not sure if it actually is, but I already made it so here it is. |
|
|
|
This model is compatible with qdrant fastembed, please note these details: |
|
|
|
- Execute model without pooling and without normalization |
|
- Pay attention to the example query format in the code below |
|
|
|
# Quantization method |
|
|
|
I did an int4 quantization pass with block size == 128 (block size 32 was extremely close in accuracy), with the same nodes excluded as from my uint8 model. |
|
|
|
Then I quantized the remaining non-excluded nodes to uint8 the same way as here: https://huggingface.co/electroglyph/Qwen3-Embedding-0.6B-onnx-uint8 |
|
|
|
<details> |
|
<summary>Here are the nodes I excluded</summary> |
|
|
|
```python |
|
["/0/auto_model/ConstantOfShape", |
|
"/0/auto_model/Constant_28", |
|
"/0/auto_model/layers.25/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.26/input_layernorm/Pow", |
|
"/0/auto_model/layers.25/input_layernorm/Pow", |
|
"/0/auto_model/layers.24/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.24/input_layernorm/Pow", |
|
"/0/auto_model/layers.23/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.23/input_layernorm/Pow", |
|
"/0/auto_model/layers.22/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.22/input_layernorm/Pow", |
|
"/0/auto_model/layers.3/input_layernorm/Pow", |
|
"/0/auto_model/layers.4/input_layernorm/Pow", |
|
"/0/auto_model/layers.3/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.21/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.5/input_layernorm/Pow", |
|
"/0/auto_model/layers.4/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.5/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.6/input_layernorm/Pow", |
|
"/0/auto_model/layers.6/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.7/input_layernorm/Pow", |
|
"/0/auto_model/layers.8/input_layernorm/Pow", |
|
"/0/auto_model/layers.7/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.26/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.9/input_layernorm/Pow", |
|
"/0/auto_model/layers.8/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.21/input_layernorm/Pow", |
|
"/0/auto_model/layers.20/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.9/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.10/input_layernorm/Pow", |
|
"/0/auto_model/layers.20/input_layernorm/Pow", |
|
"/0/auto_model/layers.11/input_layernorm/Pow", |
|
"/0/auto_model/layers.10/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.12/input_layernorm/Pow", |
|
"/0/auto_model/layers.11/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.12/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.13/input_layernorm/Pow", |
|
"/0/auto_model/layers.19/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.13/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.14/input_layernorm/Pow", |
|
"/0/auto_model/layers.19/input_layernorm/Pow", |
|
"/0/auto_model/layers.18/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.14/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.15/input_layernorm/Pow", |
|
"/0/auto_model/layers.16/input_layernorm/Pow", |
|
"/0/auto_model/layers.15/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.18/input_layernorm/Pow", |
|
"/0/auto_model/layers.17/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.17/input_layernorm/Pow", |
|
"/0/auto_model/layers.16/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.27/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.27/input_layernorm/Pow", |
|
"/0/auto_model/norm/Pow", |
|
"/0/auto_model/layers.25/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.25/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.26/input_layernorm/Add", |
|
"/0/auto_model/layers.26/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.25/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.25/input_layernorm/Add", |
|
"/0/auto_model/layers.24/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.24/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.24/input_layernorm/Add", |
|
"/0/auto_model/layers.24/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.23/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.23/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.23/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.23/input_layernorm/Add", |
|
"/0/auto_model/layers.22/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.22/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.26/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.26/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.22/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.22/input_layernorm/Add", |
|
"/0/auto_model/layers.3/input_layernorm/Add", |
|
"/0/auto_model/layers.3/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.21/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.21/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.4/input_layernorm/Add", |
|
"/0/auto_model/layers.4/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.3/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.3/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.5/input_layernorm/Add", |
|
"/0/auto_model/layers.5/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.4/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.4/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.5/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.5/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.6/input_layernorm/Add", |
|
"/0/auto_model/layers.6/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.6/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.6/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.7/input_layernorm/Add", |
|
"/0/auto_model/layers.7/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.8/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.8/input_layernorm/Add", |
|
"/0/auto_model/layers.7/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.7/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.9/input_layernorm/Add", |
|
"/0/auto_model/layers.9/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.8/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.8/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.21/input_layernorm/Add", |
|
"/0/auto_model/layers.21/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.20/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.20/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.9/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.9/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.10/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.10/input_layernorm/Add", |
|
"/0/auto_model/layers.20/input_layernorm/Add", |
|
"/0/auto_model/layers.20/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.11/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.11/input_layernorm/Add", |
|
"/0/auto_model/layers.10/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.10/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.12/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.12/input_layernorm/Add", |
|
"/0/auto_model/layers.11/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.11/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.12/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.12/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.13/input_layernorm/Add", |
|
"/0/auto_model/layers.13/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.19/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.19/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.13/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.13/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.14/input_layernorm/Add", |
|
"/0/auto_model/layers.14/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.19/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.19/input_layernorm/Add", |
|
"/0/auto_model/layers.18/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.18/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.14/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.14/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.15/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.15/input_layernorm/Add", |
|
"/0/auto_model/layers.16/input_layernorm/Add", |
|
"/0/auto_model/layers.16/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.15/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.15/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.18/input_layernorm/Add", |
|
"/0/auto_model/layers.18/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.17/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.17/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.17/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.17/input_layernorm/Add", |
|
"/0/auto_model/layers.16/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.16/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.27/post_attention_layernorm/Add", |
|
"/0/auto_model/layers.27/post_attention_layernorm/ReduceMean", |
|
"/0/auto_model/layers.27/input_layernorm/Add", |
|
"/0/auto_model/layers.27/input_layernorm/ReduceMean", |
|
"/0/auto_model/layers.27/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.14/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.26/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.25/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.26/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.8/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.24/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.24/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.25/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.23/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.27/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.12/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.13/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.2/mlp/down_proj/MatMul", |
|
"/0/auto_model/layers.3/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.3/Add", |
|
"/0/auto_model/layers.3/Add_1", |
|
"/0/auto_model/layers.4/input_layernorm/Cast", |
|
"/0/auto_model/layers.3/input_layernorm/Cast", |
|
"/0/auto_model/layers.2/Add_1", |
|
"/0/auto_model/layers.4/Add", |
|
"/0/auto_model/layers.4/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.5/input_layernorm/Cast", |
|
"/0/auto_model/layers.4/Add_1", |
|
"/0/auto_model/layers.5/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.5/Add", |
|
"/0/auto_model/layers.5/Add_1", |
|
"/0/auto_model/layers.6/input_layernorm/Cast", |
|
"/0/auto_model/layers.7/Add_1", |
|
"/0/auto_model/layers.8/input_layernorm/Cast", |
|
"/0/auto_model/layers.7/Add", |
|
"/0/auto_model/layers.7/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.6/Add", |
|
"/0/auto_model/layers.6/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.6/Add_1", |
|
"/0/auto_model/layers.7/input_layernorm/Cast", |
|
"/0/auto_model/layers.8/Add", |
|
"/0/auto_model/layers.8/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.9/input_layernorm/Cast", |
|
"/0/auto_model/layers.8/Add_1", |
|
"/0/auto_model/layers.9/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.9/Add", |
|
"/0/auto_model/layers.9/Add_1", |
|
"/0/auto_model/layers.10/input_layernorm/Cast", |
|
"/0/auto_model/layers.11/input_layernorm/Cast", |
|
"/0/auto_model/layers.10/Add_1", |
|
"/0/auto_model/layers.10/Add", |
|
"/0/auto_model/layers.10/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.11/Add", |
|
"/0/auto_model/layers.11/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.11/Add_1", |
|
"/0/auto_model/layers.12/input_layernorm/Cast", |
|
"/0/auto_model/layers.12/Add", |
|
"/0/auto_model/layers.12/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.12/Add_1", |
|
"/0/auto_model/layers.13/input_layernorm/Cast", |
|
"/0/auto_model/layers.13/Add", |
|
"/0/auto_model/layers.13/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.14/input_layernorm/Cast", |
|
"/0/auto_model/layers.13/Add_1", |
|
"/0/auto_model/layers.14/Add_1", |
|
"/0/auto_model/layers.15/input_layernorm/Cast", |
|
"/0/auto_model/layers.14/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.14/Add", |
|
"/0/auto_model/layers.15/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.15/Add_1", |
|
"/0/auto_model/layers.16/input_layernorm/Cast", |
|
"/0/auto_model/layers.15/Add", |
|
"/0/auto_model/layers.17/input_layernorm/Cast", |
|
"/0/auto_model/layers.16/Add_1", |
|
"/0/auto_model/layers.16/Add", |
|
"/0/auto_model/layers.16/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.19/input_layernorm/Cast", |
|
"/0/auto_model/layers.18/Add_1", |
|
"/0/auto_model/layers.18/input_layernorm/Cast", |
|
"/0/auto_model/layers.17/Add_1", |
|
"/0/auto_model/layers.17/Add", |
|
"/0/auto_model/layers.17/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.18/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.18/Add", |
|
"/0/auto_model/layers.19/Add", |
|
"/0/auto_model/layers.19/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.22/Add_1", |
|
"/0/auto_model/layers.23/input_layernorm/Cast", |
|
"/0/auto_model/layers.20/Add_1", |
|
"/0/auto_model/layers.21/input_layernorm/Cast", |
|
"/0/auto_model/layers.21/Add_1", |
|
"/0/auto_model/layers.22/input_layernorm/Cast", |
|
"/0/auto_model/layers.19/Add_1", |
|
"/0/auto_model/layers.20/input_layernorm/Cast", |
|
"/0/auto_model/layers.24/input_layernorm/Cast", |
|
"/0/auto_model/layers.23/Add_1", |
|
"/0/auto_model/layers.22/Add", |
|
"/0/auto_model/layers.22/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.21/Add", |
|
"/0/auto_model/layers.21/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.20/Add", |
|
"/0/auto_model/layers.20/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.23/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.23/Add", |
|
"/0/auto_model/layers.25/input_layernorm/Cast", |
|
"/0/auto_model/layers.24/Add_1", |
|
"/0/auto_model/layers.24/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.24/Add", |
|
"/0/auto_model/layers.25/Add", |
|
"/0/auto_model/layers.25/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.25/Add_1", |
|
"/0/auto_model/layers.26/input_layernorm/Cast", |
|
"/0/auto_model/layers.26/Add", |
|
"/0/auto_model/layers.26/post_attention_layernorm/Cast", |
|
"/0/auto_model/layers.21/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.26/Add_1", |
|
"/0/auto_model/layers.27/input_layernorm/Cast", |
|
"/0/auto_model/layers.27/Add", |
|
"/0/auto_model/layers.27/post_attention_layernorm/Cast", |
|
"/0/auto_model/norm/Add", |
|
"/0/auto_model/norm/ReduceMean", |
|
"/0/auto_model/layers.23/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.21/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.22/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.10/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.19/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.2/mlp/Mul", |
|
"/0/auto_model/layers.22/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.11/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.20/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.20/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.18/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.17/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.27/mlp/down_proj/MatMul", |
|
"/0/auto_model/layers.19/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.27/Add_1", |
|
"/0/auto_model/norm/Cast", |
|
"/0/auto_model/layers.16/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.18/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.11/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.9/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.26/self_attn/q_norm/Add", |
|
"/0/auto_model/layers.26/self_attn/q_norm/ReduceMean", |
|
"/0/auto_model/layers.14/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.14/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.16/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.27/mlp/Mul", |
|
"/0/auto_model/layers.27/self_attn/q_norm/ReduceMean", |
|
"/0/auto_model/layers.27/self_attn/q_norm/Add", |
|
"/0/auto_model/layers.9/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.17/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.26/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.26/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.25/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.25/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.13/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.13/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.10/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.25/input_layernorm/Mul_1", |
|
"/0/auto_model/layers.27/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.27/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.26/input_layernorm/Mul_1", |
|
"/0/auto_model/layers.15/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.12/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.12/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.25/self_attn/q_norm/Add", |
|
"/0/auto_model/layers.25/self_attn/q_norm/ReduceMean", |
|
"/0/auto_model/layers.24/input_layernorm/Mul_1", |
|
"/0/auto_model/layers.12/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.24/self_attn/q_norm/ReduceMean", |
|
"/0/auto_model/layers.24/self_attn/q_norm/Add", |
|
"/0/auto_model/layers.24/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.24/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.22/mlp/Mul", |
|
"/0/auto_model/layers.2/post_attention_layernorm/Pow", |
|
"/0/auto_model/layers.23/mlp/Mul", |
|
"/0/auto_model/layers.24/mlp/Mul", |
|
"/0/auto_model/layers.23/input_layernorm/Mul_1", |
|
"/0/auto_model/layers.14/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.14/self_attn/k_proj/MatMul", |
|
"/0/auto_model/layers.14/self_attn/k_norm/Cast", |
|
"/0/auto_model/layers.14/self_attn/Reshape_1", |
|
"/0/auto_model/layers.21/mlp/Mul", |
|
"/0/auto_model/layers.3/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.3/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.4/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.5/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.4/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.5/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.6/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.6/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.8/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.8/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.7/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.7/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.9/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.10/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.9/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.11/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.10/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.12/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.11/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.12/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.13/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.14/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.13/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.15/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.14/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.16/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.15/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.17/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.16/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.19/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.17/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.18/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.18/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.19/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.23/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.20/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.21/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.22/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.22/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.24/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.20/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.21/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.23/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.25/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.24/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.25/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.26/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.26/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.15/self_attn/k_norm/Pow", |
|
"/0/auto_model/layers.27/input_layernorm/Sqrt", |
|
"/0/auto_model/layers.27/post_attention_layernorm/Sqrt", |
|
"/0/auto_model/layers.2/input_layernorm/Pow", |
|
"/0/auto_model/layers.26/mlp/Mul", |
|
"/0/auto_model/layers.23/self_attn/q_norm/Add", |
|
"/0/auto_model/layers.23/self_attn/q_norm/ReduceMean", |
|
"/0/auto_model/layers.13/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.21/self_attn/q_norm/Add", |
|
"/0/auto_model/layers.21/self_attn/q_norm/ReduceMean", |
|
"/0/auto_model/layers.6/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.27/self_attn/Reshape_7", |
|
"/0/auto_model/layers.27/self_attn/MatMul_1", |
|
"/0/auto_model/layers.27/self_attn/Transpose_4", |
|
"/0/auto_model/layers.26/self_attn/Expand_1", |
|
"/0/auto_model/layers.26/self_attn/Unsqueeze_19", |
|
"/0/auto_model/layers.26/self_attn/v_proj/MatMul", |
|
"/0/auto_model/layers.26/self_attn/Transpose_2", |
|
"/0/auto_model/layers.26/self_attn/Reshape_6", |
|
"/0/auto_model/layers.26/self_attn/Reshape_2", |
|
"/0/auto_model/layers.11/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.11/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.22/input_layernorm/Mul_1", |
|
"/0/auto_model/layers.25/mlp/Mul", |
|
"/0/auto_model/layers.8/self_attn/k_norm/Cast", |
|
"/0/auto_model/layers.8/self_attn/k_proj/MatMul", |
|
"/0/auto_model/layers.8/self_attn/Reshape_1", |
|
"/0/auto_model/layers.21/input_layernorm/Mul_1", |
|
"/0/auto_model/layers.5/self_attn/q_norm/Pow", |
|
"/0/auto_model/layers.22/self_attn/q_norm/ReduceMean", |
|
"/0/auto_model/layers.22/self_attn/q_norm/Add", |
|
"/0/auto_model/layers.22/mlp/down_proj/MatMul", |
|
"/0/auto_model/layers.23/self_attn/k_norm/ReduceMean", |
|
"/0/auto_model/layers.23/self_attn/k_norm/Add", |
|
"/0/auto_model/layers.23/mlp/down_proj/MatMul", |
|
"/0/auto_model/layers.26/mlp/down_proj/MatMul", |
|
"/0/auto_model/layers.1/self_attn/Add_2", |
|
"/0/auto_model/layers.2/self_attn/Add_2", |
|
"/0/auto_model/layers.6/self_attn/Add_2", |
|
"/0/auto_model/layers.11/self_attn/Add_2", |
|
"/0/auto_model/layers.12/self_attn/Add_2", |
|
"/0/auto_model/layers.16/self_attn/Add_2", |
|
"/0/auto_model/layers.21/self_attn/Add_2", |
|
"/0/auto_model/layers.24/self_attn/Add_2", |
|
"/0/auto_model/layers.0/self_attn/Add_2", |
|
"/0/auto_model/layers.8/self_attn/Add_2", |
|
"/0/auto_model/layers.13/self_attn/Add_2", |
|
"/0/auto_model/layers.26/self_attn/Add_2", |
|
"/0/auto_model/layers.3/self_attn/Add_2", |
|
"/0/auto_model/layers.15/self_attn/Add_2", |
|
"/0/auto_model/layers.25/self_attn/Add_2", |
|
"/0/auto_model/layers.4/self_attn/Add_2", |
|
"/0/auto_model/layers.14/self_attn/Add_2", |
|
"/0/auto_model/layers.22/self_attn/Add_2", |
|
"/0/auto_model/layers.9/self_attn/Add_2", |
|
"/0/auto_model/layers.23/self_attn/Add_2", |
|
"/0/auto_model/layers.10/self_attn/Add_2", |
|
"/0/auto_model/layers.5/self_attn/Add_2", |
|
"/0/auto_model/layers.19/self_attn/Add_2", |
|
"/0/auto_model/layers.7/self_attn/Add_2", |
|
"/0/auto_model/layers.27/self_attn/Add_2", |
|
"/0/auto_model/layers.18/self_attn/Add_2", |
|
"/0/auto_model/layers.20/self_attn/Add_2", |
|
"/0/auto_model/layers.17/self_attn/Add_2", |
|
"/0/auto_model/Slice_1", |
|
"/0/auto_model/layers.5/self_attn/Slice_4", |
|
"/0/auto_model/layers.12/self_attn/Slice_4", |
|
"/0/auto_model/layers.18/self_attn/Slice_4", |
|
"/0/auto_model/layers.3/self_attn/Slice_4", |
|
"/0/auto_model/layers.11/self_attn/Slice_4", |
|
"/0/auto_model/layers.22/self_attn/Slice_4", |
|
"/0/auto_model/Expand", |
|
"/0/auto_model/layers.4/self_attn/Slice_4", |
|
"/0/auto_model/Slice_2", |
|
"/0/auto_model/layers.8/self_attn/Slice_4", |
|
"/0/auto_model/layers.2/self_attn/Slice_4", |
|
"/0/auto_model/layers.15/self_attn/Slice_4", |
|
"/0/auto_model/layers.26/self_attn/Slice_4", |
|
"/0/auto_model/layers.24/self_attn/Slice_4", |
|
"/0/auto_model/Expand_1", |
|
"/0/auto_model/layers.14/self_attn/Slice_4", |
|
"/0/auto_model/layers.21/self_attn/Slice_4", |
|
"/0/auto_model/layers.1/self_attn/Slice_4", |
|
"/0/auto_model/Reshape_2", |
|
"/0/auto_model/layers.19/self_attn/Slice_4", |
|
"/0/auto_model/Slice", |
|
"/0/auto_model/layers.6/self_attn/Slice_4", |
|
"/0/auto_model/layers.0/self_attn/Slice_4", |
|
"/0/auto_model/layers.25/self_attn/Slice_4", |
|
"/0/auto_model/Unsqueeze_4", |
|
"/0/auto_model/layers.10/self_attn/Slice_4", |
|
"/0/auto_model/layers.23/self_attn/Slice_4", |
|
"/0/auto_model/layers.17/self_attn/Slice_4", |
|
"/0/auto_model/Where_1", |
|
"/0/auto_model/layers.27/self_attn/Slice_4", |
|
"/0/auto_model/layers.20/self_attn/Slice_4", |
|
"/0/auto_model/Add", |
|
"/0/auto_model/Mul", |
|
"/0/auto_model/layers.7/self_attn/Slice_4", |
|
"/0/auto_model/layers.13/self_attn/Slice_4", |
|
"/0/auto_model/layers.9/self_attn/Slice_4", |
|
"/0/auto_model/layers.16/self_attn/Slice_4", |
|
"/0/auto_model/Unsqueeze_3", |
|
"/0/auto_model/ScatterND"] |
|
``` |
|
|
|
</details> |
|
|
|
# Benchmarks |
|
|
|
## Speed |
|
|
|
Method = Big chunk of text x10 runs |
|
|
|
Seconds elapsed for dynamic_int4.onnx: 45.37 (this model) |
|
|
|
Seconds elapsed for opt_f32.onnx: 46.07 (base f32 model preprocessed for quantization) |
|
|
|
Seconds elapsed for dynamic_uint8.onnx: 34.61 (probably the one you want to use on CPU) |
|
|
|
Verdict: This model kinda sucks on CPU. Let me know how it is on GPU please. |
|
|
|
## Accuracy |
|
|
|
I used beir-qdrant with the scifact dataset. |
|
|
|
This retrieval benchmark isn't the greatest result. |
|
|
|
I welcome any additional benchmarks by the community, please feel free to share any further results. |
|
|
|
If someone wants to sponsor me with an NVIDIA GPU I can have a much faster turnaround time with my model experiments and explore some different quantization strategies. |
|
|
|
|
|
onnx f32 model with f32 output (baseline): |
|
|
|
``` |
|
ndcg: {'NDCG@1': 0.57, 'NDCG@3': 0.65655, 'NDCG@5': 0.68177, 'NDCG@10': 0.69999, 'NDCG@100': 0.72749, 'NDCG@1000': 0.73301} |
|
recall: {'Recall@1': 0.53828, 'Recall@3': 0.71517, 'Recall@5': 0.77883, 'Recall@10': 0.83056, 'Recall@100': 0.95333, 'Recall@1000': 0.99667} |
|
precision: {'P@1': 0.57, 'P@3': 0.26111, 'P@5': 0.17467, 'P@10': 0.09467, 'P@100': 0.01083, 'P@1000': 0.00113} |
|
``` |
|
|
|
onnx dynamic int4/uint8 model with f32 output (this model's parent): |
|
|
|
``` |
|
ndcg: {'NDCG@1': 0.55333, 'NDCG@3': 0.6491, 'NDCG@5': 0.6674, 'NDCG@10': 0.69277, 'NDCG@100': 0.7183, 'NDCG@1000': 0.72434} |
|
recall: {'Recall@1': 0.52161, 'Recall@3': 0.71739, 'Recall@5': 0.7645, 'Recall@10': 0.83656, 'Recall@100': 0.95, 'Recall@1000': 0.99667} |
|
precision: {'P@1': 0.55333, 'P@3': 0.26222, 'P@5': 0.17067, 'P@10': 0.095, 'P@100': 0.0108, 'P@1000': 0.00113} |
|
``` |
|
|
|
onnx dynamic int4/uint8 model with uint8 output (this model): |
|
|
|
``` |
|
ndcg: {'NDCG@1': 0.55333, 'NDCG@3': 0.64613, 'NDCG@5': 0.67406, 'NDCG@10': 0.68834, 'NDCG@100': 0.71482, 'NDCG@1000': 0.72134} |
|
recall: {'Recall@1': 0.52161, 'Recall@3': 0.70961, 'Recall@5': 0.77828, 'Recall@10': 0.81822, 'Recall@100': 0.94333, 'Recall@1000': 0.99333} |
|
precision: {'P@1': 0.55333, 'P@3': 0.25889, 'P@5': 0.17533, 'P@10': 0.09333, 'P@100': 0.01073, 'P@1000': 0.00112} |
|
``` |
|
|
|
# Example inference/benchmark code and how to use the model with Fastembed |
|
|
|
After installing beir-qdrant make sure to upgrade fastembed. |
|
|
|
```python |
|
# pip install qdrant_client beir-qdrant |
|
# pip install -U fastembed |
|
from fastembed import TextEmbedding |
|
from fastembed.common.model_description import PoolingType, ModelSource |
|
from beir import util |
|
from beir.datasets.data_loader import GenericDataLoader |
|
from beir.retrieval.evaluation import EvaluateRetrieval |
|
from qdrant_client import QdrantClient |
|
from qdrant_client.models import Datatype |
|
from beir_qdrant.retrieval.models.fastembed import DenseFastEmbedModelAdapter |
|
from beir_qdrant.retrieval.search.dense import DenseQdrantSearch |
|
|
|
TextEmbedding.add_custom_model( |
|
model="electroglyph/Qwen3-Embedding-0.6B-onnx-int4", |
|
pooling=PoolingType.DISABLED, |
|
normalization=False, |
|
sources=ModelSource(hf="electroglyph/Qwen3-Embedding-0.6B-onnx-int4"), |
|
dim=1024, |
|
model_file="dynamic_int4.onnx", |
|
) |
|
|
|
dataset = "scifact" |
|
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset) |
|
data_path = util.download_and_unzip(url, "datasets") |
|
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test") |
|
|
|
# IMPORTANT: USE THIS (OR A SIMILAR) QUERY FORMAT WITH THIS MODEL: |
|
for k in queries.keys(): |
|
queries[k] = ( |
|
f"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: {queries[k]}" |
|
) |
|
|
|
qdrant_client = QdrantClient("http://localhost:6333") |
|
|
|
model = DenseQdrantSearch( |
|
qdrant_client, |
|
model=DenseFastEmbedModelAdapter(model_name="Qwen3-Embedding-0.6B-onnx-uint8"), |
|
collection_name="scifact-qwen3-uint8", |
|
initialize=True, |
|
datatype=Datatype.UINT8, |
|
) |
|
|
|
retriever = EvaluateRetrieval(model) |
|
results = retriever.retrieve(corpus, queries) |
|
|
|
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values) |
|
print(f"ndcg: {ndcg}\nrecall: {recall}\nprecision: {precision}") |
|
|
|
``` |