Context Length

by vacekj - opened May 7

Discussion

vacekj

May 7

Is there a way to increase the context length for this model to 128k, like the unsloth quants? I am using ik_llama.

ubergarm

Owner May 7

•

edited May 7

I haven't tried it myself, but I'd suggest trying:

llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

This is from the original Qwen modelcard here: https://huggingface.co/Qwen/Qwen3-30B-A3B#processing-long-texts which has other info. Keep in mind the model card has warnings about this potentially negatively impacting performance as well for shorter context lengths.

I don't think there is anything special about the unsloth quant except maybe they added these parameters by default into the GGUF kv metadata and possibly used a different strategy for imatrix calibration though I've not seen the methodology documented in a repeatable way myself. I would love to see those details if anyone has a link!

If this doesn't work on ik's fork might be possible to just pass in some --override-kv overrides to achieve the same result. Let me know if you figure it out, otherwise I might look into it eventually. Cheers!

mtcl

May 13

128k would be really great! If I can utilize 128k on qwen3 then I'll quit using ktransformers+deepseek v3 with 32k on q4km. I do have AMX+ 512GB RAM+2*4090.

Something for me to try out after work day today!

ubergarm

Owner May 13

@mtcl

Keep in mind the original Qwen models leave yarn off by default and you should only enable as much additional context as you need for your specific use cases.

If your prompts are always under 32k, it suggests not enabling it:

All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set factor as 2.0.

This is regardless of what quant you use.

If you have enough VRAM to run 128k, assuming your use case is a lot of under 32k length prompts, you could go with --parallel 4 so each slot has 32k and no need for yarn. Then just keep the slots full for higher aggregate throughput than running one at a time.

Really depends on your use case how you want to run it, but you have a lot of options.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment