steampunque's picture
Update README.md
f5ab73f verified
---
license: apache-2.0
base_model: Qwen/Qwen3-32B
base_model_relation: quantized
tags:
- Qwen
- Qwen3
- GGUF
- quantized
- 4-bit
---
## Llama.cpp hybrid layer quantization of Qwen3-32B by Alibaba
Original model: https://huggingface.co/Qwen/Qwen3-32B
The hybrid quant employs different quantization levels on a per layer basis to increased
flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
and more bits at cortex layers to simultaneously optimize quantized size and model performance.
These quants were specifically optimized for the Qwen3 32B model for size similar to IQ4_XS
while meeting or exceeding performance of the IQ4_XS quant using K quants in all the layers
for faster CPU processing on partially offloaded models.
The layer quants are as follows:
```
LAYER_TYPES='[
[0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
[8 ,"Q3_K_M"],[9 ,"Q3_K_M"],[10,"Q3_K_M"],[11,"Q3_K_M"],[12,"Q3_K_M"],[13,"Q3_K_M"],[14,"Q3_K_M"],[15,"Q3_K_M"],
[16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"],
[24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q3_K_L"],[29,"Q3_K_L"],[30,"Q3_K_L"],[31,"Q3_K_L"],
[32,"Q4_K_S"],[33,"Q3_K_L"],[34,"Q4_K_S"],[35,"Q3_K_L"],[36,"Q4_K_S"],[37,"Q3_K_L"],[38,"Q4_K_S"],[39,"Q3_K_L"],
[40,"Q4_K_S"],[41,"Q4_K_S"],[42,"Q4_K_S"],[43,"Q4_K_S"],[44,"Q4_K_S"],[45,"Q4_K_S"],[46,"Q4_K_S"],[47,"Q4_K_S"],
[48,"Q4_K_M"],[49,"Q4_K_S"],[50,"Q4_K_M"],[51,"Q4_K_S"],[52,"Q4_K_M"],[53,"Q4_K_S"],[54,"Q4_K_M"],[55,"Q4_K_S"],
[56,"Q4_K_M"],[57,"Q4_K_M"],[58,"Q4_K_M"],[59,"Q4_K_M"],[60,"Q4_K_M"],[61,"Q4_K_M"],[62,"Q4_K_M"],[63,"Q4_K_M"]
]'
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K"
```
These quants were select based on combined subjective and objective performance
evaluations to give both high performance and reduced file size.
Comparison:
Quant | size | PPL | Comment
---------|---------|------|-----------
IQ4_XS | 17.9e9 | 7.8 | default embed and output
Q4_K_H | 17.9e9 | 7.8 | Q4_K embed Q6_K output
Full evals for Q4_K_H quant are available at https://huggingface.co/spaces/steampunque/benchlm
## Download the file from below:
| Link | Type | Size/e9 B | Notes |
|------|------|-----------|-------|
| [Qwen3-32B.Q4_K_H.gguf](https://huggingface.co/steampunque/Qwen3-32B-Hybrid-GGUF/resolve/main/Qwen3-32B.Q4_K_H.gguf) | Q4_K_H | 17.9e9 B | IQ4_XS+ quality |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
https://github.com/ggml-org/llama.cpp/discussions/13040