steampunque
/

Qwen3-4B-Hybrid-GGUF

8-bit precision

Model card Files Files and versions Community

steampunque commited on about 1 month ago

Commit

a546bf4

·

verified ·

1 Parent(s): 64fe02d

Create README.md

Files changed (1) hide show

README.md +55 -0

README.md ADDED Viewed

	@@ -0,0 +1,55 @@

+---
+license: apache-2.0
+base_model: Qwen/Qwen3-4B
+base_model_relation: quantized
+tags:
+- Qwen
+- Qwen3
+- GGUF
+- quantized
+- 8-bit
+---
+## Llama.cpp hybrid layer quantization of Qwen3-4B by Alibaba
+Original model: https://huggingface.co/Qwen/Qwen3-4B
+The hybrid quant employs different quantization levels on a per layer basis to increased
+flexibility of trading off performance vs file size.  Less parameter bits are used at deep layers
+and more bits at cortex layers to simultaneously optimize quantized size and model performance.
+These quants were specifically optimized for the Qwen3 4B edge model for essentially no performance loss
+vs Q8_0 quant while reducing file size about 0.6G.
+The layer quants are as follows:
+```
+   LAYER_TYPES='[
+   [0 ,"Q8_0"  ],[1 ,"Q5_K_M"],[2 ,"Q5_K_M"],[3 ,"Q5_K_M"],[4 ,"Q5_K_M"],[5 ,"Q5_K_M"],
+   [6 ,"Q5_K_M"],[7 ,"Q5_K_M"],[8, "Q5_K_M"],[9, "Q5_K_M"],[10,"Q5_K_M"],[11,"Q5_K_M"],
+   [12,"Q6_K"  ],[13,"Q6_K"  ],[14,"Q6_K"  ],[15,"Q6_K"  ],[16,"Q6_K"  ],[17,"Q6_K"  ],
+   [18,"Q6_K"  ],[19,"Q6_K"  ],[20,"Q6_K"  ],[21,"Q6_K"  ],[22,"Q6_K"  ],[23,"Q6_K"  ],
+   [24,"Q8_0"  ],[25,"Q8_0"  ],[26,"Q8_0"  ],[27,"Q8_0"  ],[28,"Q8_0"  ],[29,"Q8_0"  ],
+   [30,"Q8_0"  ],[31,"Q8_0"  ],[32,"Q8_0"  ],[33,"Q8_0"  ],[34,"Q8_0"  ],[35,"Q8_0"  ]
+   ]'
+   FLAGS="--token-embedding-type Q8_0 --output-tensor-type Q6_K"
+```
+These quants were select based on combined subjective and objective performance
+evaluations to give both high performance and reduced file size.
+Comparison:
+Quant |  size  |  PPL |   Comment
+---------|---------|------|-----------
+Q8_0   | 4.3e9 | 13.2 | default embed and output
+Q8_0_H   | 3.6e9 | 13.1  | Q8_0 embed Q6_K output
+Full evals comparing Qwen3 4B Q8_0 and Q8_0_H are also available at https://huggingface.co/spaces/steampunque/benchlm
+## Download the file from below:
+| Link | Type | Size/e9 B | Notes |
+|------|------|-----------|-------|
+| [Qwen3-4B.Q8_0_H.gguf](https://huggingface.co/steampunque/Qwen3-4B-Hybrid-GGUF/resolve/main/Qwen3-4B.Q8_0_H.gguf) | Q8_0_H | 3.6e9 B | Q8_0 quality |
+A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
+https://github.com/ggml-org/llama.cpp/discussions/13040