Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model: Qwen/Qwen3-4B
|
4 |
+
base_model_relation: quantized
|
5 |
+
tags:
|
6 |
+
- Qwen
|
7 |
+
- Qwen3
|
8 |
+
- GGUF
|
9 |
+
- quantized
|
10 |
+
- 8-bit
|
11 |
+
---
|
12 |
+
|
13 |
+
## Llama.cpp hybrid layer quantization of Qwen3-4B by Alibaba
|
14 |
+
|
15 |
+
Original model: https://huggingface.co/Qwen/Qwen3-4B
|
16 |
+
|
17 |
+
The hybrid quant employs different quantization levels on a per layer basis to increased
|
18 |
+
flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
|
19 |
+
and more bits at cortex layers to simultaneously optimize quantized size and model performance.
|
20 |
+
These quants were specifically optimized for the Qwen3 4B edge model for essentially no performance loss
|
21 |
+
vs Q8_0 quant while reducing file size about 0.6G.
|
22 |
+
|
23 |
+
The layer quants are as follows:
|
24 |
+
```
|
25 |
+
LAYER_TYPES='[
|
26 |
+
[0 ,"Q8_0" ],[1 ,"Q5_K_M"],[2 ,"Q5_K_M"],[3 ,"Q5_K_M"],[4 ,"Q5_K_M"],[5 ,"Q5_K_M"],
|
27 |
+
[6 ,"Q5_K_M"],[7 ,"Q5_K_M"],[8, "Q5_K_M"],[9, "Q5_K_M"],[10,"Q5_K_M"],[11,"Q5_K_M"],
|
28 |
+
[12,"Q6_K" ],[13,"Q6_K" ],[14,"Q6_K" ],[15,"Q6_K" ],[16,"Q6_K" ],[17,"Q6_K" ],
|
29 |
+
[18,"Q6_K" ],[19,"Q6_K" ],[20,"Q6_K" ],[21,"Q6_K" ],[22,"Q6_K" ],[23,"Q6_K" ],
|
30 |
+
[24,"Q8_0" ],[25,"Q8_0" ],[26,"Q8_0" ],[27,"Q8_0" ],[28,"Q8_0" ],[29,"Q8_0" ],
|
31 |
+
[30,"Q8_0" ],[31,"Q8_0" ],[32,"Q8_0" ],[33,"Q8_0" ],[34,"Q8_0" ],[35,"Q8_0" ]
|
32 |
+
]'
|
33 |
+
FLAGS="--token-embedding-type Q8_0 --output-tensor-type Q6_K"
|
34 |
+
```
|
35 |
+
|
36 |
+
These quants were select based on combined subjective and objective performance
|
37 |
+
evaluations to give both high performance and reduced file size.
|
38 |
+
|
39 |
+
Comparison:
|
40 |
+
|
41 |
+
Quant | size | PPL | Comment
|
42 |
+
---------|---------|------|-----------
|
43 |
+
Q8_0 | 4.3e9 | 13.2 | default embed and output
|
44 |
+
Q8_0_H | 3.6e9 | 13.1 | Q8_0 embed Q6_K output
|
45 |
+
|
46 |
+
Full evals comparing Qwen3 4B Q8_0 and Q8_0_H are also available at https://huggingface.co/spaces/steampunque/benchlm
|
47 |
+
|
48 |
+
## Download the file from below:
|
49 |
+
| Link | Type | Size/e9 B | Notes |
|
50 |
+
|------|------|-----------|-------|
|
51 |
+
| [Qwen3-4B.Q8_0_H.gguf](https://huggingface.co/steampunque/Qwen3-4B-Hybrid-GGUF/resolve/main/Qwen3-4B.Q8_0_H.gguf) | Q8_0_H | 3.6e9 B | Q8_0 quality |
|
52 |
+
|
53 |
+
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
|
54 |
+
|
55 |
+
https://github.com/ggml-org/llama.cpp/discussions/13040
|