steampunque
/

Qwen3-4B-Hybrid-GGUF

8-bit precision

Model card Files Files and versions Community

Qwen3-4B-Hybrid-GGUF / README.md

steampunque's picture

Create README.md

a546bf4 verified about 1 month ago

|

history blame contribute delete

2.22 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-4B
	base_model_relation: quantized
	tags:
	- Qwen
	- Qwen3
	- GGUF
	- quantized
	- 8-bit
	---

	## Llama.cpp hybrid layer quantization of Qwen3-4B by Alibaba

	Original model: https://huggingface.co/Qwen/Qwen3-4B

	The hybrid quant employs different quantization levels on a per layer basis to increased
	flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
	and more bits at cortex layers to simultaneously optimize quantized size and model performance.
	These quants were specifically optimized for the Qwen3 4B edge model for essentially no performance loss
	vs Q8_0 quant while reducing file size about 0.6G.

	The layer quants are as follows:
	```
	LAYER_TYPES='[
	[0 ,"Q8_0" ],[1 ,"Q5_K_M"],[2 ,"Q5_K_M"],[3 ,"Q5_K_M"],[4 ,"Q5_K_M"],[5 ,"Q5_K_M"],
	[6 ,"Q5_K_M"],[7 ,"Q5_K_M"],[8, "Q5_K_M"],[9, "Q5_K_M"],[10,"Q5_K_M"],[11,"Q5_K_M"],
	[12,"Q6_K" ],[13,"Q6_K" ],[14,"Q6_K" ],[15,"Q6_K" ],[16,"Q6_K" ],[17,"Q6_K" ],
	[18,"Q6_K" ],[19,"Q6_K" ],[20,"Q6_K" ],[21,"Q6_K" ],[22,"Q6_K" ],[23,"Q6_K" ],
	[24,"Q8_0" ],[25,"Q8_0" ],[26,"Q8_0" ],[27,"Q8_0" ],[28,"Q8_0" ],[29,"Q8_0" ],
	[30,"Q8_0" ],[31,"Q8_0" ],[32,"Q8_0" ],[33,"Q8_0" ],[34,"Q8_0" ],[35,"Q8_0" ]
	]'
	FLAGS="--token-embedding-type Q8_0 --output-tensor-type Q6_K"
	```

	These quants were select based on combined subjective and objective performance
	evaluations to give both high performance and reduced file size.

	Comparison:

	Quant \| size \| PPL \| Comment
	---------\|---------\|------\|-----------
	Q8_0 \| 4.3e9 \| 13.2 \| default embed and output
	Q8_0_H \| 3.6e9 \| 13.1 \| Q8_0 embed Q6_K output

	Full evals comparing Qwen3 4B Q8_0 and Q8_0_H are also available at https://huggingface.co/spaces/steampunque/benchlm

	## Download the file from below:
	\| Link \| Type \| Size/e9 B \| Notes \|
	\|------\|------\|-----------\|-------\|
	\| [Qwen3-4B.Q8_0_H.gguf](https://huggingface.co/steampunque/Qwen3-4B-Hybrid-GGUF/resolve/main/Qwen3-4B.Q8_0_H.gguf) \| Q8_0_H \| 3.6e9 B \| Q8_0 quality \|

	A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

	https://github.com/ggml-org/llama.cpp/discussions/13040