Qwen3-Reranker-4B-GGUF

Purpose

Multilingual text-reranking model in GGUF format for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
Parameters ≈ 4 B • Context length 32K

Files

Filename Precision Size* Est. quality Δ vs FP16 Notes
Qwen3-Reranker-4B-F16.gguf FP16 7.5 GB 0 (reference) Direct HF→GGUF
Qwen3-Reranker-4B-F16-Q8_0.gguf Q8_0 4.3 GB TBD Near-lossless
Qwen3-Reranker-4B-F16-Q6_K.gguf Q6_K 3.5 GB TBD Size / quality trade-off
Qwen3-Reranker-4B-F16-Q5_K_M.gguf Q5_K_M 3.1 GB TBD Tight-memory recall
Qwen3-Reranker-4B-F16-Q4_K_M.gguf Q4_K_M 2.8 GB TBD Smallest; CPU-friendly

*rounded binary GiB.

Upstream Source

  • RepoQwen/Qwen3-Reranker-4B
  • Commitf16fc5d (Jun 9 2025):contentReference[oaicite:1]{index=1}
  • License Apache-2.0

Conversion & Quantization

# 1. Convert HF → GGUF (FP16)
python convert_hf_to_gguf.py Qwen/Qwen3-Reranker-4B \
       --outfile Qwen3-Reranker-4B-F16.gguf \
       --leave-output-tensor --outtype f16

# 2. Quantize (keep token embeddings in FP16)
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
  llama-quantize $EMB_OPT Qwen3-Reranker-4B-F16.gguf \
                 Qwen3-Reranker-4B-F16-${QT}.gguf \
                 $QT $(nproc)
done
Downloads last month
20
GGUF
Model size
4.02B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JonathanMiddleton/Qwen3-Reranker-4B-GGUF

Base model

Qwen/Qwen3-4B-Base
Quantized
(9)
this model