Worse than expected performance

#1
by martijk1 - opened

I've tried the Q5_K_L version of this model, as well as the Q4_K_L version of the 32B model, and for some reason, both made many spelling and grammar mistakes in my native language. I've tried the unsloth Q5_K_M and Q4_K_M quants respectively, and from my limited testing, those don't seem to suffer from the same problem. I've used the same parameters for both. I've had good experiences with your quants before, so I'm surprised. Could there be an issue with these quants?

Can you share any exact prompts and/or tools used?

I've tried the Q5_K_L version of this model, as well as the Q4_K_L version of the 32B model, and for some reason, both made many spelling and grammar mistakes in my native language. I've tried the unsloth Q5_K_M and Q4_K_M quants respectively, and from my limited testing, those don't seem to suffer from the same problem. I've used the same parameters for both. I've had good experiences with your quants before, so I'm surprised. Could there be an issue with these quants?

Yes .. for the multi language use I suggest Gemma 3 27b

Can you share any exact prompts and/or tools used?

Prompt:

Zijn whisky stones net zo effectief als water om drinken af te koelen?

Meaning: are whisky stones as effective as water to chill beverages?

I used ollama through open webui with a context length and num predict of 32K and the applicable parameters at the officially recommended values for the default thinking mode. All other parameters were left at the default values. I remember that it used “verduinen” (nonsense word) instead of “verdunnen” (dilute) (even doubled down twice when I asked about it) and it used “opkoelen” (nonsense word, means cooling up) instead of “afkoelen” (chill) or “opwarmen” (heat up). The MoE answer also contained a lot of nonsense in general.

I used the same prompt to test the unsloth quants and got sensible responses, but I didn’t repeat the test to confirm that those don’t suffer from the same issue. I’ll perform some more tests tomorrow (CEST) and report back. It’s just that I’m not used to these kinds of mistakes from similar sized models at 4 or 5 bit quantizations, so it feels off.

I don't normally write posts or comments, but felt like I should chime in. This isn't directed towards martijk1 in any way. (As mentioned, Gemma is much better in general for multilingual tasks. I believe Mistral also does well, if I recall correctly, depending on your native language. Aya-expanse too!)

It's this odd larger trend I've noticed. For example, on /r/locallama, everyone seems to be waiting for Unsloth to fix all the others 'broken' quantizations with every major release. (I'd post this over there, but honestly, I'm too lazy to make an account. I'm a professional lurker) I feel like ole Bartowski unfairly gets lumped into this idea that his quants are in a broken state somehow until Unsloth 'fixes' them. Seeing that perception take hold, where someone might constantly have to push back against the notion that their freely contributed work is 'broken'... it's hard to imagine that scenario not being genuinely discouraging or just overall, crappy.

In reality, both are great, driven by people who genuinely love what they do. This guy's a wizard. He clearly puts a lot of effort, time, and passion into perfecting his craft, all while contributing to llamacpp and juggling a professional life at Acree. (Just noticed this; of course you work at the org that makes my favorite finetunes.) How does anyone manage to juggle all of this at once? I have no idea. I'm going with the theory that The Bloke transformed into an AI and named himself Bartowski. Seriously though, I just wanted to say I appreciate all that you do, and thank you for helping the community while asking nothing in return, keep being awesome.

Thanks for coming to my TED talk.

That aside, I've tested both Bartowski's and Unsloth's Q4_K_M, Q4_K_L/XL, Q5_K_L, and Q6_K quants, and they performed virtually the same in my little gauntlet – accounting for the normal, inevitable variations and noise that I'm far too dumb to test properly, let alone fully understand. I mostly vibe tested 1.7B, 4B, 8B, 14B, 30B-A3B, 32B. (The 0.6B model is somewhere punching the air right now.) They all scored the same on all questions and specific format requests. If it weren't for the filename, I would have had no idea I hadn't just loaded the same quant file again.

tl;dr: Both make awesome quants and seem like awesome people, possibly sentient AI.

Out of my own curiosity, I ran some tests using @martijk1 's prompt on a few models (Bartowski Q4_K_M/L, Unsloth Q4_K_M/XL). I didn't see any of the gibberish words you mentioned, solely because you wrote them out, that is! Dutch is a beautiful looking language, even the gibberish looks nice; needless to say, I have no clue what's being said. Here's some equally gibberish-level data, maybe any of it is useful, if links aren't allowed here,
rip: https://privatebin.net/?f31ac877775cd12b#F3EUyfeALctkSPogguReLYHqVM7vSHByrhquY9tbgKdK

edit: woops, probably useful to know. For reference, tested on a local build with the latest commit being e98b3692be4cd8fbbd9a56fbacc2f2bf0bf26a68

@atopwhether thanks for your very nice comment. I’m equally impressed with Bartowkski’s work and I wouldn’t even have started this discussion if that weren’t the case. I don’t remember which model it was, but I’ve been in the opposite situation where the unsloth quants somehow seemed to be broken for me. I’ve actually standardized on Bartowski’s quants because I trust his work and because virtually every model you can think of is somehow part of his repertoire.

About Gemma: I fully agree, I use it for all my language related prompts and for many general questions. Everything that’s not language related I usually ask in English anyway, no matter which model, but I used the aforementioned prompt as a quick test to see how Qwen 3 responds.

I’ll try again in a few hours and see if it’s reproducible. If not, I must have messed something up with a parameter. I usually attribute flawed logic and hallucinations to the model itself (or the slight quality loss because of quantization), but these made up words really stood out, as language itself usually works fine.

@atopwhether your Dutch responses are actually also riddled with gibberish. Maybe this model is just very bad at Dutch, so I’ll try and see if I can get the unsloth quants to make the same types of mistakes.

Edit: a quick test with unsloth was flawless again in terms of language, but I don’t have enough time right now to do a more in depth test.

I've done some extra testing with different Dutch prompts and my conclusion is that this model is pretty weak in general in Dutch (at least at 5 bit quants and lower), but consistently worse in Bartowski's Q5_K_L than in unsloth's Q5_K_M. While unsloth's version makes some mistakes with the capitalization of the letter "IJ" (this is one letter in Dutch, I know, it's weird) and sometimes picks an incorrect article, Bartowski's version just makes up words, or mixes in a lot of English words and constructions. Maybe this model is just very sensitive to the differences in quantization methods.

I've also compared with unsloth's Q3_K_M quant, to see if it breaks down, and that's indeed the case. At that point it starts making up words too. So I guess the conclusion is that this model's Dutch is weak and suffers greatly from quantization, and around 5 bit, minor differences can have a large effect.

Case closed, I guess!

martijk1 changed discussion status to closed

Sign up or log in to comment