Appreciation.

#1
by atopwhether - opened

Thank you so much, for a multitude of reasons! Firstly, I've always enjoyed reading your posts in LocalLLaMA; you are one of the very few users I tend to notice in the sea of posts. I always learn something new, and I've never seen you be anything other than extremely kind and genuinely helpful, that's awesome.

So many times I've sat for hours reading discussions on the ik_llama.cpp GitHub, trying to understand concepts brilliant people speak about so effortlessly, only to admittedly get discouraged and conclude I'd give the project a go another time. I always kept tabs on the discussions over there, building my understanding with each new one. It's special seeing you, ikawrakow, bartowski, collaborating, sharing data and ideas that my brain can't begin to comprehend. It's neat. It's one of those things about open source that often flies under the radar, and I wanted to share my appreciation.

Perfect segue: thank you so much for writing your quick start guide for ik_llama. I'd read it a few times over the last few weeks without diving in. That was until I read your Qwen3-30B-A3B Reddit post today, saw you were even kind enough to share your quants for it, said "hell yeah," and dove in. I am extremely glad that I did!

I have a bit of a frankenlambuild™ for running local models: a 3080 Ti and 2080 Ti, a 980 Pro NVMe, and two completely different sets of DDR4 (2x16GB of 3200 Vengeance and 2x8GB of the cheapest C18 3600 TeamForce), all tied together with an i9-12900k. Once I got into AI, I just kinda tossed parts from other builds into this to try and run bigger models. It's not winning any beauty or efficiency contests, but it works pretty decently after many hours of BIOS tweaking.

With the frankenbuild context known, I went from running Q4_K_M Qwen3-30B-A3B at:

llamacpp
prompt eval time =      22.24 ms /     1 tokens (   22.24 ms per token,    44.96 tokens per second)
eval time =    2628.01 ms /   129 tokens (   20.37 ms per token,    49.09 tokens per second)

to ik_llama (Using your quant):

prompt eval time     =      71.10 ms /     9 tokens (    7.90 ms per token,   126.59 tokens per second) | 
generation eval time =    2201.46 ms /   174 runs   (   12.65 ms per token,    79.04 tokens per second) | 

I was shocked by the insane speed improvement, absolutely amazing! I'm going to give the Qwen3-235B-A22B a whirl. I think I might be a bit underpowered to run it at a usable level, but I'm super curious.

I'm still working on understanding how --override-tensor works on a more fundamental level. With a multi-GPU setup (one 12GB, one 11GB), the regex gets confusing quickly. I'm excited to get that honed in, absorb more of your knowledge to create a proper quant for llama scout, and simply learn more about all of this side of the LLM world.

Thanks again! I appreciate the quants, the guide, the knowledge – you're awesome!

@atopwhether

Wow! Such an amazing thoughtful post! hugs

I'm really happy to hear you're getting better numbers with this quant on your amusingly frankenrig! And yeah a fast NVMe makes loading these things much better.

And totes the --override-tensor can get pretty confusing/ugly passing in a bunch of escaped regex on the command line... I know other folks and myself are still figuring out useful common patterns for various models.

I never made a Scout/Maverick quant, but I believe ik maybe has a discussion about them as they are MoE they should run pretty good on ik's fork.

And yeah its be great to find a community of folks all sharing information and testing out ideas to improve performance and quality and share it with the wider community. Thanks for doing your part and appreciate the encouragement! <3

Sign up or log in to comment