Any chance of creating these with RoPE/Yarn for a context size larger than 32k?
Unsloth did this with their UD quants up to 128k which was really useful and meant you can run their GGUFs directly in Ollama, and in llama.cpp without forcing the override of the RoPE settings in the server.
Also, fyi the embedded chat template seems broken in these quants:
common_chat_templates_init: failed to parse chat template (defaulting to chatml): Expected value expression at row 18, column 30:
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
^
{%- set index = (messages|length - 1) - loop.index0 %}
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, chat_template: {%- for message in messages -%}
{{- '<|im_start|>' + message.role + '
' + message.content + '<|im_end|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|im_start|>assistant
' -}}
{%- endif -%}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
*Edit: The template could be related to this https://github.com/ggml-org/llama.cpp/issues/13178#issuecomment-2839416968
Missing RoPE/Yarn is related to https://github.com/ggml-org/llama.cpp/pull/13331 and requires us to requant the model. This was not yet implemented back when we originally quantized the model and while I updated our llama.cpp fork in the meantime. @mradermacher Let's update the workers and the requant this model and while at it maybe we can also retry Qwen3-30B-A3B and Qwen3-30B-A3B-Base to see if those issues are fixed (which i don't think they are but worth a try).
Broken chat template is a well known issue and will be implemented in llama.cpp soon. Once done it will just work without us having to requant. Its just them missing some jinja functions like splitting as far I'm aware.
Note that llama.cpp supports jinja, but you have to enable to manually. It defaults to using minja, which does even attempt to support all jinja features (but will support the ones needed here).
@nicoboss I don't see how that would fix the weights problem (it seems to be moe-specific). llama.cpp should be updated in a short while, though, and resuming the jobs is easy. Do we need to redo the imatrix file as well?
Hmm, and the patch affects qwen2, too.