Any chance of creating these with RoPE/Yarn for a context size larger than 32k?

by smcleod - opened 4 days ago

4 days ago

Unsloth did this with their UD quants up to 128k which was really useful and meant you can run their GGUFs directly in Ollama, and in llama.cpp without forcing the override of the RoPE settings in the server.

smcleod

4 days ago

•

edited 4 days ago

Also, fyi the embedded chat template seems broken in these quants:

common_chat_templates_init: failed to parse chat template (defaulting to chatml): Expected value expression at row 18, column 30:
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
                             ^
    {%- set index = (messages|length - 1) - loop.index0 %}

srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, chat_template: {%- for message in messages -%}
  {{- '<|im_start|>' + message.role + '
' + message.content + '<|im_end|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
  {{- '<|im_start|>assistant
' -}}
{%- endif -%}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'

*Edit: The template could be related to this https://github.com/ggml-org/llama.cpp/issues/13178#issuecomment-2839416968

nicoboss

4 days ago

•

edited 4 days ago

Missing RoPE/Yarn is related to https://github.com/ggml-org/llama.cpp/pull/13331 and requires us to requant the model. This was not yet implemented back when we originally quantized the model and while I updated our llama.cpp fork in the meantime. @mradermacher Let's update the workers and the requant this model and while at it maybe we can also retry Qwen3-30B-A3B and Qwen3-30B-A3B-Base to see if those issues are fixed (which i don't think they are but worth a try).

nicoboss

4 days ago

Broken chat template is a well known issue and will be implemented in llama.cpp soon. Once done it will just work without us having to requant. Its just them missing some jinja functions like splitting as far I'm aware.

mradermacher

Owner 4 days ago

Note that llama.cpp supports jinja, but you have to enable to manually. It defaults to using minja, which does even attempt to support all jinja features (but will support the ones needed here).

mradermacher

Owner 4 days ago

@nicoboss I don't see how that would fix the weights problem (it seems to be moe-specific). llama.cpp should be updated in a short while, though, and resuming the jobs is easy. Do we need to redo the imatrix file as well?

Hmm, and the patch affects qwen2, too.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment