Patiently Waiting for this one
Alright, this will be the last one I will Download I promose! There has been a flurry of new models, which is always good, but now I am running out of space and it feels so sad to delete a model! Air version of this model is pretty good, I will just wait for this and then decide if I want to just keep Qwen, R1 or this. :)
I know OSS just released but tbh I hope GLM is next in line after Air.
I was going to throw out a idea if you did make another video and tried Q5 or Q6 since GLM is like 60% as big as V3/R1/Chimeras in addition to Q4.
But ubergarm I really appreciate these quants. Thanks for the food chef.
Yeah this one will be next, got a lot of work done with Thireus on the ik_llama.cpp PR today so feeling more confident to release knowing my GGUFs should not need a re-upload haha...
I played with the 120B dense OSS but more people will be able to run this MoE faster probably...
I'm back at my desk in the morning so should go fairly quickly, everything is ready to roll.
Thanks for the patience!
OSS is garbagio. Big GLM is the best thing we got going. Never thought I'd see a model worse than scout.
Okay got the first one on the way, I've been digging the newest ik_llama.cpp quant the IQ4_XSS which makes a nice ffn_(gate|up)
tensor right at 4.0 BPW just like iq4_kt but faster for CPU inferencing. I might make some IQ1_KT and an IQ2_KT which some may be able to full offload on VRAM (mtcl probably could until he sells a couple GPUs haha)...
Uploading now:
llm_load_print_meta: model type = 355B.A32B
llm_load_print_meta: model ftype = IQ4_KSS - 4.0 bpw
llm_load_print_meta: model params = 358.338 B
llm_load_print_meta: model size = 173.726 GiB (4.164 BPW)
llm_load_print_meta: repeating layers = 172.721 GiB (4.158 BPW, 356.786 B parameters)
llm_load_print_meta: general.name = GLM 4.5
I'll follow up with perplexity graphs eventually, they take some time to run and I want to prioritizing getting a sizes out there first.
Might do a high quality IQ5_KT ~6BPW quant as well for those that want almost q8_0 quality with 25% the size shaved off and faster TG inferencing than full q8_0 too.
$ hf upload ubergarm/GLM-4.5-GGUF ./IQ4_KSS IQ4_KSS
Start hashing 4 files.
Finished hashing 4 files.
GLM-4.5-IQ4_KSS-00002-of-00004.gguf: 24%|βββββββββββββββββ | 11.1G/46.9G [06:23<22:45, 26.2MB/s]
GLM-4.5-IQ4_KSS-00001-of-00004.gguf: 24%|ββββββββββββββββββ | 11.4G/46.6G [06:23<18:44, 31.3MB/s]
GLM-4.5-IQ4_KSS-00003-of-00004.gguf: 24%|ββββββββββββββββββ | 11.4G/46.9G [06:23<19:02, 31.1MB/s]
Upload 4 LFS files: 0%| | 0/4 [00:00<?, ?it/s]
GLM-4.5-IQ4_KSS-00004-of-00004.gguf: 24%|βββββββββββββββββ | 11.0G/46.1G [06:23<18:36, 31.5MB/s]
Got a smaller one I'm testing right now too. Perfect size for a 2x64GB DDR5@6000MT/s AM5 9950X rig, compile in the Zen5 avx_vnni PR and get some boost to PP as well!
llm_load_print_meta: model type = 355B.A32B
llm_load_print_meta: model ftype = IQ2_KL - 2.6875 bpw
llm_load_print_meta: model params = 358.338 B
llm_load_print_meta: model size = 127.746 GiB (3.062 BPW)
llm_load_print_meta: repeating layers = 126.741 GiB (3.051 BPW, 356.786 B parameters)
llm_load_print_meta: general.name = GLM 4.5
hell yeah! this looks promising! I have a 6x3090 (144GB VRAM) and this model would fit perfectly! Question - do you think it's better to run GLM 4.5 full at low quant (like IQ2_KL) or GLM 4.5 air at high quant (like Q8_K_XL)? They're roughly the same size. What do you think??
Got a smaller one I'm testing right now too. Perfect size for a 2x64GB DDR5@6000MT/s AM5 9950X rig, compile in the Zen5 avx_vnni PR and get some boost to PP as well!
llm_load_print_meta: model type = 355B.A32B llm_load_print_meta: model ftype = IQ2_KL - 2.6875 bpw llm_load_print_meta: model params = 358.338 B llm_load_print_meta: model size = 127.746 GiB (3.062 BPW) llm_load_print_meta: repeating layers = 126.741 GiB (3.051 BPW, 356.786 B parameters) llm_load_print_meta: general.name = GLM 4.5
@Hiasma
I have a 6x3090 (144GB VRAM) and this model would fit perfectly!
Great! Yes keep in mind that the last layer of both the larger GLM-4.5 and smaller Air while they take up roughly few GiB on disk, they don't get loaded into RAM/VRAM so actual size requirement is a little smaller than the file size here. (A new tensor load flag called TENSOR_SKIP). This allows the NextN MTP (Multi Token Prediction) quants to exist if this feature were to ever be implemented you wouldn't have to download a new quant!
do you think it's better to run GLM 4.5 full at low quant (like IQ2_KL) or GLM 4.5 air at high quant (like Q8_K_XL)? They're roughly the same size. What do you think??
That is a really good question! I hope to release a GLM-4.5-Air around 6BPW weight that should be within a couple percent of the Q8_0 in terms of perplexity so would inference faster TG due to smaller tensor size (and TG is memory bandwidth limited for these quants [if I make a small IQ1_KT it would be CPU limited for TG]).
Probably if my application needed more "knowledge" baked in then I'd probably reach for the larger model despite higher quantization. If I had a RAG or some context related work I'd probably go for a bigger quant of Air. But honestly I don't have enough experience with them both at the various quantizations to have sufficiently vibe checked it haha...
fwiw the perplexity of Air-Q8_0 is 4.5798 +/- 0.02804
wiki.test.raw and the perplexity of full size GLM-4.5-Q8_0 is 3.1746 +/- 0.01784
. The IQ2_KL I'm still processing but will upload a graph soon! Can't really compare across models, but just some data points.
@Hiasma
I have a 6x3090 (144GB VRAM) and this model would fit perfectly!
Great! Yes keep in mind that the last layer of both the larger GLM-4.5 and smaller Air while they take up roughly few GiB on disk, they don't get loaded into RAM/VRAM so actual size requirement is a little smaller than the file size here. (A new tensor load flag called TENSOR_SKIP). This allows the NextN MTP (Multi Token Prediction) quants to exist if this feature were to ever be implemented you wouldn't have to download a new quant!
do you think it's better to run GLM 4.5 full at low quant (like IQ2_KL) or GLM 4.5 air at high quant (like Q8_K_XL)? They're roughly the same size. What do you think??
That is a really good question! I hope to release a GLM-4.5-Air around 6BPW weight that should be within a couple percent of the Q8_0 in terms of perplexity so would inference faster TG due to smaller tensor size (and TG is memory bandwidth limited for these quants [if I make a small IQ1_KT it would be CPU limited for TG]).
Probably if my application needed more "knowledge" baked in then I'd probably reach for the larger model despite higher quantization. If I had a RAG or some context related work I'd probably go for a bigger quant of Air. But honestly I don't have enough experience with them both at the various quantizations to have sufficiently vibe checked it haha...
fwiw the perplexity of Air-Q8_0 is
4.5798 +/- 0.02804
wiki.test.raw and the perplexity of full size GLM-4.5-Q8_0 is3.1746 +/- 0.01784
. The IQ2_KL I'm still processing but will upload a graph soon! Can't really compare across models, but just some data points.
sounds good, looking forward to it! Thanks for the detailed answer. I will probably just end up trying both and deciding for myself. But it would definitely be nice to see the perplexity scores for an objective measurement I can cross-reference as well. My main use-case will be agentic coding (cline in vscode). That use-case leads me to believe GLM 4.5 full would be the way to go but it could be splitting hairs.
Thank you so much for getting this over the line! Very nice fix for the issue with chat template matching π
For now I only have 128GB of DDR4 RAM and 16GB of VRAM, so I have to copy your imatrix again (thanks!), and tweak the recipe slightly to make slightly smaller quant..
do you think it's better to run GLM 4.5 full at low quant (like IQ2_KL) or GLM 4.5 air at high quant (like Q8_K_XL)?
I can share a bit on this.. So I have a code refactoring eval with deterministic output. It is not perfectly deterministic due to batching, so I also rely on logprobs to make sure the model doesn't just get it right based on luck.
The outcome of the eval is simple -- if a model can't pass this test perfectly, then I can't trust it for coding.
I did the test using the official z.ai endpoint about a week ago. GLM-4.5-Air made several mistakes, while GLM-4.5 passed the test. Both were tested with thinking disabled, as this is mostly a pattern matching eval, and overthinking harms the result.
Then I did some local testing with a IQ2_K quant of GLM-4.5 (IQ2_K for moe layers, IQ4_K for the rest). Without imatrix, it made a couple of mistakes. With imatrix, it passed the test.
Now I am testing a IQ2_KL quant of GLM-4.5 (IQ2_KL for moe layers, IQ4_KSS for the rest). Even without imatrix, it passed the test.
So I will definitely choose GLM-4.5 IQ2_KL over GLM-4.5-Air Q8. The caveat is that other than the eval, I don't use the models much for actual coding yet. Hopefully that changes soon, now that we have a strong local model that was trained for agentic tool calls.
How big is that last layer? I'm half way done with Q3_K_XL, IQ4_KSS looks like an upgrade depending on perplexity. Will probably have to use mainline to print all the tensors and file sizes then get to cramming tensors.
I released a few more quants (with recipes available now) both this one and Air and added a graph of the perplexities I have so far including baseline if you want to compare your recipes.
How big is that last layer?
You can use ik_llama.cpp to print all the tensors too, works fine. just need a python venv with numpy==1.26.4
and a few more packages then run $ python ./gguf-py/scripts/gguf_dump.py /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-Q8_0.gguf | less
or similar.
Here is the final layer for GLM-4.5 big boi:
1737: 775946240 | 5120, 151552, 1, 1 | Q8_0 | blk.92.nextn.shared_head_head.weight
1738: 5120 | 5120, 1, 1, 1 | F32 | output_norm.weight
1739: 52428800 | 10240, 5120, 1, 1 | Q8_0 | blk.92.nextn.eh_proj.weight
1740: 5120 | 5120, 1, 1, 1 | F32 | blk.92.nextn.enorm.weight
1741: 5120 | 5120, 1, 1, 1 | F32 | blk.92.nextn.hnorm.weight
1742: 5120 | 5120, 1, 1, 1 | F32 | blk.92.attn_norm.weight
1743: 1258291200 | 1536, 5120, 160, 1 | Q8_0 | blk.92.ffn_down_exps.weight
1744: 1258291200 | 5120, 1536, 160, 1 | Q8_0 | blk.92.ffn_gate_exps.weight
1745: 1258291200 | 5120, 1536, 160, 1 | Q8_0 | blk.92.ffn_up_exps.weight
1746: 160 | 160, 1, 1, 1 | F32 | blk.92.exp_probs_b.bias
1747: 819200 | 5120, 160, 1, 1 | F32 | blk.92.ffn_gate_inp.weight
1748: 7864320 | 1536, 5120, 1, 1 | Q8_0 | blk.92.ffn_down_shexp.weight
1749: 7864320 | 5120, 1536, 1, 1 | Q8_0 | blk.92.ffn_gate_shexp.weight
1750: 7864320 | 5120, 1536, 1, 1 | Q8_0 | blk.92.ffn_up_shexp.weight
1751: 5120 | 5120, 1, 1, 1 | F32 | blk.92.post_attention_norm.weight
1752: 128 | 128, 1, 1, 1 | F32 | blk.92.attn_k_norm.weight
1753: 1024 | 1024, 1, 1, 1 | F32 | blk.92.attn_k.bias
1754: 5242880 | 5120, 1024, 1, 1 | Q8_0 | blk.92.attn_k.weight
1755: 62914560 | 12288, 5120, 1, 1 | Q8_0 | blk.92.attn_output.weight
1756: 128 | 128, 1, 1, 1 | F32 | blk.92.attn_q_norm.weight
1757: 12288 | 12288, 1, 1, 1 | F32 | blk.92.attn_q.bias
1758: 62914560 | 5120, 12288, 1, 1 | Q8_0 | blk.92.attn_q.weight
1759: 1024 | 1024, 1, 1, 1 | F32 | blk.92.attn_v.bias
1760: 5242880 | 5120, 1024, 1, 1 | Q8_0 | blk.92.attn_v.weight
1761: 5120 | 5120, 1, 1, 1 | F32 | blk.92.nextn.shared_head_norm.weight
Keep in mind this entire last layer is now marked with SKIP_TENSOR
flag so is not loaded on startup (as the LOG_WARNING messages show you). Also there is no imatrix data for this entire layer because the compute graph skips it entirely which seems to be the correct implementation in my empirical testing showing including the last layer actually gave "worse" perplexity in an earlier experimental version.
So for now it only takes up disk space and is not used nor loaded at all which is a new feature. The idea being if MTP (multi-token prediction) is implemented in the future then folks won't have to re-download quants.