Qwen3-Embedding-8B-GGUF

Purpose

Multilingual text-embedding model in GGUF format for efficient CPU/GPU inference with llama.cpp and derivatives.

Files

Filename Precision Size* Est. MTEB Ξ” vs FP16 Notes
Qwen3-Embedding-8B-F16.gguf FP16 15.1 GB 0 Direct conversion; reference quality
Qwen3-Embedding-8B-Q8_0.gguf Q8_0 8.6 GB β‰ˆ +0.02 Full-precision parity for most tasks
Qwen3-Embedding-8B-Q6_K.gguf Q6_K 6.9 GB β‰ˆ +0.20 Balanced size / quality
Qwen3-Embedding-8B-Q5_K_M.gguf Q5_K_M 6.16 GB β‰ˆ +0.35 Good recall under tight memory
Qwen3-Embedding-8B-Q4_K_M.gguf Q4_K_M 5.41 GB β‰ˆ +0.60 Lowest-size CPU-friendly build

Upstream source

Conversion

  • Code base : llama.cpp commit a20f0a1 + PR #14029 (Qwen embedding support).
  • Command:
    python convert_hf_to_gguf.py Qwen/Qwen3-Embedding-8B \
          --outfile Qwen3-Embedding-8B-F16.gguf \
          --leave-output-tensor \
          --outtype f16
    
    
    BASE=$(basename "${SRC%.*}")  
    DIR=$(dirname "$SRC")
    
    EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
    
    for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
      OUT="${DIR}/${BASE}-${QT}.gguf"
      echo ">> quantising ${QT}  ->  $(basename "$OUT")"
      llama-quantize $EMB_OPT "$SRC" "$OUT" "$QT" $(nproc)
    done
    
Downloads last month
87
GGUF
Model size
7.57B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JonathanMiddleton/Qwen3-Embedding-8B-GGUF

Base model

Qwen/Qwen3-8B-Base
Quantized
(4)
this model