Updated Title: UDQ4_K_XL - Great Rust coder
Hi team,
- Ubuntu 24.04.02 LTS
- RTX 5090 (575.64.03)
- Cline
- Ollama 0.10.1
- Flash Attention on
- KV Cache FP16 (140k context) and Q8_0 (240k context)
- UDQ4_K_XL
The model is really good at coding & debugging the first 40k tokens. Above that it starts to default to "simple solutions" if it can't fix straight away, by taking some essential code - in most cases not breaking it.
As soon it it starts to reach 100k Cline starts to hang a lot between requests, to the point it times-out my 30s trigger. Despite having around 140t/s throughout, the model never attempts to reach the max context. The behavior is the same whether I max out with FP16@140k or Q8_0@240k, or even reducing at 90/100k context.
I've tried using Roo Code, Continue.dev, Qwen-Code CLI but none of them work due bad tool calls. I haven't tried exhausting the context with Ollama yet.
I will try using a different quant, and also I'm going to attempt using the 1 million context model restricted at 240k if it's any worth.
If you need more details, give a shout.
did you find any improvements using different quants?
Yes.
I tried a different provider for similar Q5_K_XL quant and the behavior was the same with both Q8 and FP16 KV Cache.
Then, I tested Unsloth's 1 million context model with UDQ4_K_XL quant with Q8 KV Cache, it was able to go through 160k context. Didn't go higher as my work was completed by then.
However, it got significantly slower as it climbed over 100k, making it theoretically quicker to finish if I had restarted the session with fresh context. I could maybe extend the context up to 500k @ Q4 KV Cache with my VRAM, but I don't think it would get there with Cline.
Thank you for your response! Could you please also paste the command parameters for starting it with increased context? Is rope setting needed?
As stated in model card, 256k context window is native. The 1 million parameter is another GGUF Unsloth has - different from the one we're chatting. They made it integrated to run that max.
Update.
I mislead myself using nvidia-smi to check total load upfront. The UDQ4_K_XL doesn't actually run on 240k context window on 30GB VRAM with Q8_0 KV Cache.
Roughly 60k context window is actually used on VRAM before offloading into CPU. I was under the impression the model was losing performance above this number, when it was actually being loaded into RAM without showing on HTOP, but heavier signs of CPU usage as we reached +60k threshold. The ghost RAM usage could only be seen through Ollama's container RAM usage, as it was loaded upfront without showing signs through regular means.
That being said, the model works fine up to 120k before it becomes too slow for coding. It performs best in Cline in my experience, and I'm a CLI fan. The checkpoints are massively useful.
Occasionally, the model has strange tendencies to avoid fixing complex problems above 60k, or wants to delete the complex problems by replacing them with comments or over-simplified snippets, instead of finding a corrective, proper implementation. Not great for those code-vibing to death. Supervision is necessary.
However, it has been the best model I've ever used for coding locally, and I'm using it on a +10k lines codebase in Rust, and the least amount of reliance on SOTA closed source models.
I can say UDQ4_K_XL has been the best performer in comparison to a i1Q5_K_XL.
In my senior Rust testing prompts, Q8_0 KV Cache performs as good as FP16. Going towards Q4_0 is not advisable as the output quality degrades significantly. Any extra context window gained from Q4_0 is diminished by the loops the model has to perform to fix its own outputs.