Can you provide some low-precision quantization options?

#3
by lingyezhixing - opened

For example, a Q2-level quantization around 45GB in size? My hardware is really struggling to quantize this model locally. Thank you very much

For example, a Q2-level quantization around 45GB in size?

Yeah I'll see what I can do. The issue with the Air model is that all of the ffn_down.* tensors are not divisible by 256 so are limited to older quantization types. The larger full size GLM is easier to manage.

I'll have something up within a couple days probably, still finalizing the PR on ik_llama.cpp fork though it is looking pretty good now.

@espen96

Assuming you are the same espen96 as https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3160558205

@ubergarm could I request something in-between the Iq4_KSS and IQ5_K for size, for the Air variant? Perhaps IQ4_K would be a good option?

Can probably get a decent IQ5_KSS that is just a little smaller than the IQ5_K with less jacked-up attn/shexp/first dense layer. Otherwise IQ4_K would be pretty good too. I'll fish around in that area and see what I like.

Assuming you are the same espen96 as https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3160558205

Can probably get a decent IQ5_KSS that is just a little smaller than the IQ5_K with less jacked-up attn/shexp/first dense layer. Otherwise IQ4_K would be pretty good too. I'll fish around in that area and see what I like.

That would indeed be me. not sure why I didn't go with the handle I usually use now, but yea...

Sounds good!

32k context is absolutely redlining all the memory I have when I use the IQ5_K quant, with general OS usage on top. I have about 500mb left on the 3090, 600mb on the 2060 and 1gb system memory to go.
not ideal. that's one spike away from system instability.

IQ5_KSS should help quite a bit! you know better than me how to do this, I leave it in your hands!

@espen96

Okay, uploading two models for you with perplexity data published now so you can choose your own adventure. the iq5_ks should give you just enough breathing room to have a few tabs open in firefox now too hahah

The iq4_k will give you plenty of extra space without sacrificing too much perplexity.

upload will complete within 15 minutes or so!

enjoy!

There are some obscure quantization formats in ik_llama, I've tried some of them, and found that iq2_bn works for ffn_down. However, I can't find any info about the iq2_bn format.

There are some obscure quantization formats in ik_llama, I've tried some of them, and found that iq2_bn works for ffn_down. However, I can't find any info about the iq2_bn format.

@anikifoss

Yeah I kinda messed around with that before realizing some quants are specifically for certain types of models. the BN is probably BitNet and not good for anything that is not BitNet (ternary models). Likewise don't use TQ1_0 types, if you look at unsloth's smaller quants that are labeled TQ1_0 do not actually contain any tensors with that quantization types which is very confusing and incorrect.

I'd suggest if you're not sure about a type, look it up in closed PRs either on ik or mainline to see who created it and for what specific reason.

The best small quants for general CUDA/CPU use are proabably IQ1_S, IQ1_M, and the KT quants (knowing they are slower on CPU for TG).

The new MXFP4 is only for gpt-oss etc.

But yeah there are a lot of options with various levels of support for various hardware and kernels.

Yeah, with iq2_bn the model becomes incoherent, but I'm getting 46 tokens/sec on a single RTX 5090. So that should be an upper bound on how fast this model would get with several 5090s and ik_llama.cpp.

@anikifoss

I'm getting 46 tokens/sec on a single RTX 5090.

So ~2BPW quant of GLM-4.5-Air is about ~26GiB, so are you fully offloading in this case? I'm not sure if iq2_bn has the best CUDA kernel implementation... If you're playing around experimenting with a small quant for full offload on 32GB VRAM, def try out IQ2_KT, it is probably about the best perplexity for the size in that smaller area.

I know that exllamav3 EXL3 quants of GLM-4.5-Air are available now too which likely are pretty good for full offload on newer CUDA GPUs.

Also, did you re-build ik_llama.cpp as of the past ~12 hours as GLM-4.5 and Air performance improved with the GQA fix that just got merged into main. With that patch I'm now getting about 50 tok/sec TG on two old sm86 arch CUDA RTX A6000 (non-blackwell) on a ~4.5ish BPW test quant full offload 100k context in 96GB VRAM: https://github.com/ikawrakow/ik_llama.cpp/pull/700#issuecomment-3194524662

Thanks, I'll give IQ1_KT a shot!

I'm tempted to settle for 4x RTX5090 with a good GLM4.5-Air quant. With vllm and --tensor-parallel-size 4, that should give another 2x token generation boost while also allowing parallel processing without losing generation speed. So that puts the whole setup into ~100 tokens/sec per request with parallel requests on 220k shared context. This pretty much covers local AI needs.

I was waiting for AMD, but they seem to be really behind on compute: MI50s are too weak, and their latest AI PRO R9700 lacks the extra memory to make it more attractive than RTX 5090.

@anikifoss

Thanks, I'll give IQ1_KT a shot!

Yeah, I totally forgot that, right, ffn_down.* tensors on GLM-4.5-Air has column sizes not divisible by 256 so its a real PITA. Most of my quants are using iq4_nl or q4_0 etc for ffn_down_exps as that quant was made specifically for this situation by ik long ago back on mainline.

My understanding is that that KT quants could work with divisible by 32 (i think, but can't find reference on ik_llama.cpp closed issues/PRs hah), however they are not implemented that way currently. If a lot more models come out with odd tensor sizes it might be worth checking into more.

With vllm and --tensor-parallel-size 4

Double check with folks that -tp 4 is working on vllm with sm120 5090's before buying anything, as at least in the past I have a vague recollection of someone saying they could only get it to work with 2.

I was waiting for AMD

Yeah right, we all need some cheaper VRAM for sure, and was hoping either AMD or Intel could deliver something. I saw some recent encouraging numbers by Occam (mainline lcpp vulkan dev) showing potential improvements in TG for AMD/Intel hardware on Vulkan backend for q4_0/q4_1/q8_0 quantizations: https://github.com/ggml-org/llama.cpp/pull/14903#issuecomment-3194422833 but yeah its hard to beat CUDA performance.


Does anyone already have a RAG with every github Issue and PR and hugging face discussion and localllama reddit post cuz I'm not able to keep up with it all anymore even with way too many firefox tabs open lol...

Sadly, IQ1_KT did not fit into 32G.

Double check with folks that -tp 4 is working on vllm with sm120 5090's before buying anything, as at least in the past I have a vague recollection of someone saying they could only get it to work with 2.

sm120 was a little rocky at launch, but it should have stabilized by now. Even in the worst case, getting 5090s to work with vllm will be a breather compared to MI50s.

Does anyone already have a RAG with every github Issue and PR and hugging face discussion and localllama reddit post cuz I'm not able to keep up with it all anymore even with way too many firefox tabs open lol...

Sounds like you need a personal AI assistant with RSS ingestion :)

Sign up or log in to comment