steampunque commited on
Commit
a546bf4
·
verified ·
1 Parent(s): 64fe02d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-4B
4
+ base_model_relation: quantized
5
+ tags:
6
+ - Qwen
7
+ - Qwen3
8
+ - GGUF
9
+ - quantized
10
+ - 8-bit
11
+ ---
12
+
13
+ ## Llama.cpp hybrid layer quantization of Qwen3-4B by Alibaba
14
+
15
+ Original model: https://huggingface.co/Qwen/Qwen3-4B
16
+
17
+ The hybrid quant employs different quantization levels on a per layer basis to increased
18
+ flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
19
+ and more bits at cortex layers to simultaneously optimize quantized size and model performance.
20
+ These quants were specifically optimized for the Qwen3 4B edge model for essentially no performance loss
21
+ vs Q8_0 quant while reducing file size about 0.6G.
22
+
23
+ The layer quants are as follows:
24
+ ```
25
+ LAYER_TYPES='[
26
+ [0 ,"Q8_0" ],[1 ,"Q5_K_M"],[2 ,"Q5_K_M"],[3 ,"Q5_K_M"],[4 ,"Q5_K_M"],[5 ,"Q5_K_M"],
27
+ [6 ,"Q5_K_M"],[7 ,"Q5_K_M"],[8, "Q5_K_M"],[9, "Q5_K_M"],[10,"Q5_K_M"],[11,"Q5_K_M"],
28
+ [12,"Q6_K" ],[13,"Q6_K" ],[14,"Q6_K" ],[15,"Q6_K" ],[16,"Q6_K" ],[17,"Q6_K" ],
29
+ [18,"Q6_K" ],[19,"Q6_K" ],[20,"Q6_K" ],[21,"Q6_K" ],[22,"Q6_K" ],[23,"Q6_K" ],
30
+ [24,"Q8_0" ],[25,"Q8_0" ],[26,"Q8_0" ],[27,"Q8_0" ],[28,"Q8_0" ],[29,"Q8_0" ],
31
+ [30,"Q8_0" ],[31,"Q8_0" ],[32,"Q8_0" ],[33,"Q8_0" ],[34,"Q8_0" ],[35,"Q8_0" ]
32
+ ]'
33
+ FLAGS="--token-embedding-type Q8_0 --output-tensor-type Q6_K"
34
+ ```
35
+
36
+ These quants were select based on combined subjective and objective performance
37
+ evaluations to give both high performance and reduced file size.
38
+
39
+ Comparison:
40
+
41
+ Quant | size | PPL | Comment
42
+ ---------|---------|------|-----------
43
+ Q8_0 | 4.3e9 | 13.2 | default embed and output
44
+ Q8_0_H | 3.6e9 | 13.1 | Q8_0 embed Q6_K output
45
+
46
+ Full evals comparing Qwen3 4B Q8_0 and Q8_0_H are also available at https://huggingface.co/spaces/steampunque/benchlm
47
+
48
+ ## Download the file from below:
49
+ | Link | Type | Size/e9 B | Notes |
50
+ |------|------|-----------|-------|
51
+ | [Qwen3-4B.Q8_0_H.gguf](https://huggingface.co/steampunque/Qwen3-4B-Hybrid-GGUF/resolve/main/Qwen3-4B.Q8_0_H.gguf) | Q8_0_H | 3.6e9 B | Q8_0 quality |
52
+
53
+ A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
54
+
55
+ https://github.com/ggml-org/llama.cpp/discussions/13040