Fingers crossed for the 4.6-air
4.6 is a bit too big for my setup so I sincerely hope they drop a 4.6-air variant.
Sorry, no love for the Poors from ZAI
Maybe Unsloth will quant an IQ2-XSS, it worked well for 4.5
You might try the ik_llama.cpp quants from ubergarm and such when they come out, though they only run with the ik server due to its custom quantization types. GLM 4.5 works good in 128GB RAM (+ a single GPU) with them.
...Would the unsloth guys be interested in this? Maybe combining it with whatever dynamic script y'all use? The KL and KT trellis quant types really do help a ton in that ~3bpw range.
What quant do you run with ik_llama.cpp on the 128gb machine? I have a PC with that configuration with an AMD gpu. I also have an m4 max with 128gb memory. GLM-4.5 was always very slow for some reason, much slower than other similar-sized models. I'd be curious to know what kind of speeds you get with those custom quants.
Sorry, no love for the Poors from ZAI
Maybe Unsloth will quant an IQ2-XSS, it worked well for 4.5
We fixed multiple issues with the chat template, the 2bit is out now, the rest will come in the next few hours!
Them not releasing an AIR version serves everyone right for complaining about GLM 4.5 when using the AIR version and not specifying it.
I am using IQ2_XXS
on a machine with 5950x, 128GB RAM, and 7600XT 16GB VRAM. It works, the result is good but performance is low. AIR version will be much better. I am offloading as much layers as possible to the GPU using llama.cpp.
PS I've disabled thinking by appending '/nothink' at the end of prompt and it allowed to use GLM 4.6 somehow. Example:
llama_perf_sampler_print: sampling time = 55.64 ms / 9635 runs ( 0.01 ms per token, 173176.12 tokens per second)
llama_perf_context_print: load time = 13348.83 ms
llama_perf_context_print: prompt eval time = 454403.14 ms / 9173 tokens ( 49.54 ms per token, 20.19 tokens per second)
llama_perf_context_print: eval time = 236910.60 ms / 461 runs ( 513.91 ms per token, 1.95 tokens per second)
llama_perf_context_print: total time = 691481.59 ms / 9634 tokens
llama_perf_context_print: graphs reused = 458
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - ROCm0 (Radeon™ RX 7600 XT) | 16368 = 146 + ( 15519 = 5884 + 8712 + 923) + 701 |
llama_memory_breakdown_print: | - Host | 112062 = 108553 + 3432 + 76 |
This Distill from 4.6 to AIR is working:
https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill
I've tried BasedBase/GLM-4.5-Air-GLM-4.6-Distill
and found its quality is not enough. Also, it hanged. IQ2_XXS
of GLM-4.6
is better.
Now people are going to talk about this distill as if it's the real thing. Sad!