Fingers crossed for the 4.6-air

by aaron-newsome - opened 7 days ago

Discussion

aaron-newsome

7 days ago

4.6 is a bit too big for my setup so I sincerely hope they drop a 4.6-air variant.

bobig

6 days ago

This comment has been hidden

bobig

6 days ago

bobig

6 days ago

Sorry, no love for the Poors from ZAI

Maybe Unsloth will quant an IQ2-XSS, it worked well for 4.5

Downtown-Case

6 days ago

•

edited 6 days ago

You might try the ik_llama.cpp quants from ubergarm and such when they come out, though they only run with the ik server due to its custom quantization types. GLM 4.5 works good in 128GB RAM (+ a single GPU) with them.

...Would the unsloth guys be interested in this? Maybe combining it with whatever dynamic script y'all use? The KL and KT trellis quant types really do help a ton in that ~3bpw range.

x-polyglot-x

6 days ago

@Downtown-Case

What quant do you run with ik_llama.cpp on the 128gb machine? I have a PC with that configuration with an AMD gpu. I also have an m4 max with 128gb memory. GLM-4.5 was always very slow for some reason, much slower than other similar-sized models. I'd be curious to know what kind of speeds you get with those custom quants.

danielhanchen

Unsloth AI org 6 days ago

Sorry, no love for the Poors from ZAI

Maybe Unsloth will quant an IQ2-XSS, it worked well for 4.5

We fixed multiple issues with the chat template, the 2bit is out now, the rest will come in the next few hours!

Rotating

6 days ago

Them not releasing an AIR version serves everyone right for complaining about GLM 4.5 when using the AIR version and not specifying it.

puchuu

5 days ago

•

edited 5 days ago

I am using IQ2_XXS on a machine with 5950x, 128GB RAM, and 7600XT 16GB VRAM. It works, the result is good but performance is low. AIR version will be much better. I am offloading as much layers as possible to the GPU using llama.cpp.

PS I've disabled thinking by appending '/nothink' at the end of prompt and it allowed to use GLM 4.6 somehow. Example:

llama_perf_sampler_print:    sampling time =      55.64 ms /  9635 runs   (    0.01 ms per token, 173176.12 tokens per second)
llama_perf_context_print:        load time =   13348.83 ms
llama_perf_context_print: prompt eval time =  454403.14 ms /  9173 tokens (   49.54 ms per token,    20.19 tokens per second)
llama_perf_context_print:        eval time =  236910.60 ms /   461 runs   (  513.91 ms per token,     1.95 tokens per second)
llama_perf_context_print:       total time =  691481.59 ms /  9634 tokens
llama_perf_context_print:    graphs reused =        458
llama_memory_breakdown_print: | memory breakdown [MiB]           | total   free      self    model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (Radeon™ RX 7600 XT) | 16368 =  146 + ( 15519 =   5884 +    8712 +     923) +         701 |
llama_memory_breakdown_print: |   - Host                         |                 112062 = 108553 +    3432 +      76                |

bobig

3 days ago

This Distill from 4.6 to AIR is working:

https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill

puchuu

about 6 hours ago

I've tried BasedBase/GLM-4.5-Air-GLM-4.6-Distill and found its quality is not enough. Also, it hanged. IQ2_XXS of GLM-4.6 is better.

Rotating

about 6 hours ago

Now people are going to talk about this distill as if it's the real thing. Sad!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment