|
--- |
|
license: apache-2.0 |
|
base_model: Qwen/Qwen3-32B |
|
base_model_relation: quantized |
|
tags: |
|
- Qwen |
|
- Qwen3 |
|
- GGUF |
|
- quantized |
|
- 4-bit |
|
--- |
|
|
|
## Llama.cpp hybrid layer quantization of Qwen3-32B by Alibaba |
|
|
|
Original model: https://huggingface.co/Qwen/Qwen3-32B |
|
|
|
The hybrid quant employs different quantization levels on a per layer basis to increased |
|
flexibility of trading off performance vs file size. Less parameter bits are used at deep layers |
|
and more bits at cortex layers to simultaneously optimize quantized size and model performance. |
|
These quants were specifically optimized for the Qwen3 32B model for size similar to IQ4_XS |
|
while meeting or exceeding performance of the IQ4_XS quant using K quants in all the layers |
|
for faster CPU processing on partially offloaded models. |
|
|
|
The layer quants are as follows: |
|
``` |
|
LAYER_TYPES='[ |
|
[0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"], |
|
[8 ,"Q3_K_M"],[9 ,"Q3_K_M"],[10,"Q3_K_M"],[11,"Q3_K_M"],[12,"Q3_K_M"],[13,"Q3_K_M"],[14,"Q3_K_M"],[15,"Q3_K_M"], |
|
[16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"], |
|
[24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q3_K_L"],[29,"Q3_K_L"],[30,"Q3_K_L"],[31,"Q3_K_L"], |
|
[32,"Q4_K_S"],[33,"Q3_K_L"],[34,"Q4_K_S"],[35,"Q3_K_L"],[36,"Q4_K_S"],[37,"Q3_K_L"],[38,"Q4_K_S"],[39,"Q3_K_L"], |
|
[40,"Q4_K_S"],[41,"Q4_K_S"],[42,"Q4_K_S"],[43,"Q4_K_S"],[44,"Q4_K_S"],[45,"Q4_K_S"],[46,"Q4_K_S"],[47,"Q4_K_S"], |
|
[48,"Q4_K_M"],[49,"Q4_K_S"],[50,"Q4_K_M"],[51,"Q4_K_S"],[52,"Q4_K_M"],[53,"Q4_K_S"],[54,"Q4_K_M"],[55,"Q4_K_S"], |
|
[56,"Q4_K_M"],[57,"Q4_K_M"],[58,"Q4_K_M"],[59,"Q4_K_M"],[60,"Q4_K_M"],[61,"Q4_K_M"],[62,"Q4_K_M"],[63,"Q4_K_M"] |
|
]' |
|
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K" |
|
``` |
|
These quants were select based on combined subjective and objective performance |
|
evaluations to give both high performance and reduced file size. |
|
|
|
Comparison: |
|
|
|
Quant | size | PPL | Comment |
|
---------|---------|------|----------- |
|
IQ4_XS | 17.9e9 | 7.8 | default embed and output |
|
Q4_K_H | 17.9e9 | 7.8 | Q4_K embed Q6_K output |
|
|
|
Full evals for Q4_K_H quant are available at https://huggingface.co/spaces/steampunque/benchlm |
|
|
|
## Download the file from below: |
|
| Link | Type | Size/e9 B | Notes | |
|
|------|------|-----------|-------| |
|
| [Qwen3-32B.Q4_K_H.gguf](https://huggingface.co/steampunque/Qwen3-32B-Hybrid-GGUF/resolve/main/Qwen3-32B.Q4_K_H.gguf) | Q4_K_H | 17.9e9 B | IQ4_XS+ quality | |
|
|
|
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository: |
|
|
|
https://github.com/ggml-org/llama.cpp/discussions/13040 |