steampunque
/

Qwen3-32B-Hybrid-GGUF

4-bit precision

Model card Files Files and versions Community

Qwen3-32B-Hybrid-GGUF / README.md

steampunque's picture

Update README.md

f5ab73f verified about 1 month ago

|

history blame contribute delete

2.7 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-32B
	base_model_relation: quantized
	tags:
	- Qwen
	- Qwen3
	- GGUF
	- quantized
	- 4-bit
	---

	## Llama.cpp hybrid layer quantization of Qwen3-32B by Alibaba

	Original model: https://huggingface.co/Qwen/Qwen3-32B

	The hybrid quant employs different quantization levels on a per layer basis to increased
	flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
	and more bits at cortex layers to simultaneously optimize quantized size and model performance.
	These quants were specifically optimized for the Qwen3 32B model for size similar to IQ4_XS
	while meeting or exceeding performance of the IQ4_XS quant using K quants in all the layers
	for faster CPU processing on partially offloaded models.

	The layer quants are as follows:
	```
	LAYER_TYPES='[
	[0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
	[8 ,"Q3_K_M"],[9 ,"Q3_K_M"],[10,"Q3_K_M"],[11,"Q3_K_M"],[12,"Q3_K_M"],[13,"Q3_K_M"],[14,"Q3_K_M"],[15,"Q3_K_M"],
	[16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"],
	[24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q3_K_L"],[29,"Q3_K_L"],[30,"Q3_K_L"],[31,"Q3_K_L"],
	[32,"Q4_K_S"],[33,"Q3_K_L"],[34,"Q4_K_S"],[35,"Q3_K_L"],[36,"Q4_K_S"],[37,"Q3_K_L"],[38,"Q4_K_S"],[39,"Q3_K_L"],
	[40,"Q4_K_S"],[41,"Q4_K_S"],[42,"Q4_K_S"],[43,"Q4_K_S"],[44,"Q4_K_S"],[45,"Q4_K_S"],[46,"Q4_K_S"],[47,"Q4_K_S"],
	[48,"Q4_K_M"],[49,"Q4_K_S"],[50,"Q4_K_M"],[51,"Q4_K_S"],[52,"Q4_K_M"],[53,"Q4_K_S"],[54,"Q4_K_M"],[55,"Q4_K_S"],
	[56,"Q4_K_M"],[57,"Q4_K_M"],[58,"Q4_K_M"],[59,"Q4_K_M"],[60,"Q4_K_M"],[61,"Q4_K_M"],[62,"Q4_K_M"],[63,"Q4_K_M"]
	]'
	FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K"
	```
	These quants were select based on combined subjective and objective performance
	evaluations to give both high performance and reduced file size.

	Comparison:

	Quant \| size \| PPL \| Comment
	---------\|---------\|------\|-----------
	IQ4_XS \| 17.9e9 \| 7.8 \| default embed and output
	Q4_K_H \| 17.9e9 \| 7.8 \| Q4_K embed Q6_K output

	Full evals for Q4_K_H quant are available at https://huggingface.co/spaces/steampunque/benchlm

	## Download the file from below:
	\| Link \| Type \| Size/e9 B \| Notes \|
	\|------\|------\|-----------\|-------\|
	\| [Qwen3-32B.Q4_K_H.gguf](https://huggingface.co/steampunque/Qwen3-32B-Hybrid-GGUF/resolve/main/Qwen3-32B.Q4_K_H.gguf) \| Q4_K_H \| 17.9e9 B \| IQ4_XS+ quality \|

	A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

	https://github.com/ggml-org/llama.cpp/discussions/13040