|
--- |
|
license: apache-2.0 |
|
base_model: Qwen/Qwen3-4B |
|
base_model_relation: quantized |
|
tags: |
|
- Qwen |
|
- Qwen3 |
|
- GGUF |
|
- quantized |
|
- 8-bit |
|
--- |
|
|
|
## Llama.cpp hybrid layer quantization of Qwen3-4B by Alibaba |
|
|
|
Original model: https://huggingface.co/Qwen/Qwen3-4B |
|
|
|
The hybrid quant employs different quantization levels on a per layer basis to increased |
|
flexibility of trading off performance vs file size. Less parameter bits are used at deep layers |
|
and more bits at cortex layers to simultaneously optimize quantized size and model performance. |
|
These quants were specifically optimized for the Qwen3 4B edge model for essentially no performance loss |
|
vs Q8_0 quant while reducing file size about 0.6G. |
|
|
|
The layer quants are as follows: |
|
``` |
|
LAYER_TYPES='[ |
|
[0 ,"Q8_0" ],[1 ,"Q5_K_M"],[2 ,"Q5_K_M"],[3 ,"Q5_K_M"],[4 ,"Q5_K_M"],[5 ,"Q5_K_M"], |
|
[6 ,"Q5_K_M"],[7 ,"Q5_K_M"],[8, "Q5_K_M"],[9, "Q5_K_M"],[10,"Q5_K_M"],[11,"Q5_K_M"], |
|
[12,"Q6_K" ],[13,"Q6_K" ],[14,"Q6_K" ],[15,"Q6_K" ],[16,"Q6_K" ],[17,"Q6_K" ], |
|
[18,"Q6_K" ],[19,"Q6_K" ],[20,"Q6_K" ],[21,"Q6_K" ],[22,"Q6_K" ],[23,"Q6_K" ], |
|
[24,"Q8_0" ],[25,"Q8_0" ],[26,"Q8_0" ],[27,"Q8_0" ],[28,"Q8_0" ],[29,"Q8_0" ], |
|
[30,"Q8_0" ],[31,"Q8_0" ],[32,"Q8_0" ],[33,"Q8_0" ],[34,"Q8_0" ],[35,"Q8_0" ] |
|
]' |
|
FLAGS="--token-embedding-type Q8_0 --output-tensor-type Q6_K" |
|
``` |
|
|
|
These quants were select based on combined subjective and objective performance |
|
evaluations to give both high performance and reduced file size. |
|
|
|
Comparison: |
|
|
|
Quant | size | PPL | Comment |
|
---------|---------|------|----------- |
|
Q8_0 | 4.3e9 | 13.2 | default embed and output |
|
Q8_0_H | 3.6e9 | 13.1 | Q8_0 embed Q6_K output |
|
|
|
Full evals comparing Qwen3 4B Q8_0 and Q8_0_H are also available at https://huggingface.co/spaces/steampunque/benchlm |
|
|
|
## Download the file from below: |
|
| Link | Type | Size/e9 B | Notes | |
|
|------|------|-----------|-------| |
|
| [Qwen3-4B.Q8_0_H.gguf](https://huggingface.co/steampunque/Qwen3-4B-Hybrid-GGUF/resolve/main/Qwen3-4B.Q8_0_H.gguf) | Q8_0_H | 3.6e9 B | Q8_0 quality | |
|
|
|
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository: |
|
|
|
https://github.com/ggml-org/llama.cpp/discussions/13040 |