Qwen3-Embedding-8B-GGUF
Purpose
Multilingual text-embedding model in GGUF format for efficient CPU/GPU inference with llama.cpp and derivatives.
Files
Filename | Precision | Size* | Est. MTEB Ξ vs FP16 | Notes |
---|---|---|---|---|
Qwen3-Embedding-8B-F16.gguf |
FP16 | 15.1 GB | 0 | Direct conversion; reference quality |
Qwen3-Embedding-8B-Q8_0.gguf |
Q8_0 | 8.6 GB | β +0.02 | Full-precision parity for most tasks |
Qwen3-Embedding-8B-Q6_K.gguf |
Q6_K | 6.9 GB | β +0.20 | Balanced size / quality |
Qwen3-Embedding-8B-Q5_K_M.gguf |
Q5_K_M | 6.16 GB | β +0.35 | Good recall under tight memory |
Qwen3-Embedding-8B-Q4_K_M.gguf |
Q4_K_M | 5.41 GB | β +0.60 | Lowest-size CPU-friendly build |
Upstream source
- Repository :
Qwen/Qwen3-Embedding-8B
- Commit :
1d8ad4c
(2025-07-12) - Licence : Apache-2.0
Conversion
- Code base : llama.cpp commit
a20f0a1
+ PR #14029 (Qwen embedding support). - Command:
python convert_hf_to_gguf.py Qwen/Qwen3-Embedding-8B \ --outfile Qwen3-Embedding-8B-F16.gguf \ --leave-output-tensor \ --outtype f16 BASE=$(basename "${SRC%.*}") DIR=$(dirname "$SRC") EMB_OPT="--token-embedding-type F16 --leave-output-tensor" for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUT="${DIR}/${BASE}-${QT}.gguf" echo ">> quantising ${QT} -> $(basename "$OUT")" llama-quantize $EMB_OPT "$SRC" "$OUT" "$QT" $(nproc) done
- Downloads last month
- 87
Hardware compatibility
Log In
to view the estimation
4-bit
5-bit
6-bit
8-bit
16-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support