Context Length
Is there a way to increase the context length for this model to 128k, like the unsloth quants? I am using ik_llama.
I haven't tried it myself, but I'd suggest trying:
llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
This is from the original Qwen modelcard here: https://huggingface.co/Qwen/Qwen3-30B-A3B#processing-long-texts which has other info. Keep in mind the model card has warnings about this potentially negatively impacting performance as well for shorter context lengths.
I don't think there is anything special about the unsloth quant except maybe they added these parameters by default into the GGUF kv metadata and possibly used a different strategy for imatrix calibration though I've not seen the methodology documented in a repeatable way myself. I would love to see those details if anyone has a link!
If this doesn't work on ik's fork might be possible to just pass in some --override-kv
overrides to achieve the same result. Let me know if you figure it out, otherwise I might look into it eventually. Cheers!
128k would be really great! If I can utilize 128k on qwen3 then I'll quit using ktransformers+deepseek v3 with 32k on q4km. I do have AMX+ 512GB RAM+2*4090.
Something for me to try out after work day today!
Keep in mind the original Qwen models leave yarn off by default and you should only enable as much additional context as you need for your specific use cases.
If your prompts are always under 32k, it suggests not enabling it:
All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set factor as 2.0.
This is regardless of what quant you use.
If you have enough VRAM to run 128k, assuming your use case is a lot of under 32k length prompts, you could go with --parallel 4
so each slot has 32k and no need for yarn. Then just keep the slots full for higher aggregate throughput than running one at a time.
Really depends on your use case how you want to run it, but you have a lot of options.