Qwen3-Reranker-4B-GGUF
Purpose
Multilingual text-reranking model in GGUF format for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
Parameters ≈ 4 B • Context length 32K
Files
Filename | Precision | Size* | Est. quality Δ vs FP16 | Notes |
---|---|---|---|---|
Qwen3-Reranker-4B-F16.gguf |
FP16 | 7.5 GB | 0 (reference) | Direct HF→GGUF |
Qwen3-Reranker-4B-F16-Q8_0.gguf |
Q8_0 | 4.3 GB | TBD | Near-lossless |
Qwen3-Reranker-4B-F16-Q6_K.gguf |
Q6_K | 3.5 GB | TBD | Size / quality trade-off |
Qwen3-Reranker-4B-F16-Q5_K_M.gguf |
Q5_K_M | 3.1 GB | TBD | Tight-memory recall |
Qwen3-Reranker-4B-F16-Q4_K_M.gguf |
Q4_K_M | 2.8 GB | TBD | Smallest; CPU-friendly |
*rounded binary GiB.
Upstream Source
- Repo
Qwen/Qwen3-Reranker-4B
- Commit
f16fc5d
(Jun 9 2025):contentReference[oaicite:1]{index=1} - License Apache-2.0
Conversion & Quantization
# 1. Convert HF → GGUF (FP16)
python convert_hf_to_gguf.py Qwen/Qwen3-Reranker-4B \
--outfile Qwen3-Reranker-4B-F16.gguf \
--leave-output-tensor --outtype f16
# 2. Quantize (keep token embeddings in FP16)
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
llama-quantize $EMB_OPT Qwen3-Reranker-4B-F16.gguf \
Qwen3-Reranker-4B-F16-${QT}.gguf \
$QT $(nproc)
done
- Downloads last month
- 20
Hardware compatibility
Log In
to view the estimation
4-bit
5-bit
6-bit
8-bit
16-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support