Tool calling error
500: Value is not callable: null at row 62, column 114:
{%- if json_key not in handled_keys %}
{%- set normed_json_key = json_key | replace("-", "") | replace(" ", "") | replace("$", "") %}
^
{%- if param_fields[json_key] is mapping %}
at row 62, column 21:
{%- if json_key not in handled_keys %}
{%- set normed_json_key = json_key | replace("-", "") | replace(" ", "") | replace("$", "") %}
^
{%- if param_fields[json_key] is mapping %}
at row 61, column 55:
{%- for json_key in param_fields %}
{%- if json_key not in handled_keys %}
^
{%- set normed_json_key = json_key | replace("-", "") | replace(" ", "") | replace("$", "") %}
at row 61, column 17:
{%- for json_key in param_fields %}
{%- if json_key not in handled_keys %}
^
{%- set normed_json_key = json_key | replace("-", "") | replace(" ", "") | replace("$", "") %}
at row 60, column 48:
{%- set handled_keys = ['type', 'description', 'enum', 'required'] %}
{%- for json_key in param_fields %}
^
{%- if json_key not in handled_keys %}
at row 60, column 13:
{%- set handled_keys = ['type', 'description', 'enum', 'required'] %}
{%- for json_key in param_fields %}
^
{%- if json_key not in handled_keys %}
at row 49, column 80:
{{- '\n' }}
{%- for param_name, param_fields in tool.parameters.properties|items %}
^
{{- '\n' }}
at row 49, column 9:
{{- '\n' }}
{%- for param_name, param_fields in tool.parameters.properties|items %}
^
{{- '\n' }}
at row 42, column 29:
{{- "" }}
{%- for tool in tools %}
^
{%- if tool.function is defined %}
at row 42, column 5:
{{- "" }}
{%- for tool in tools %}
^
{%- if tool.function is defined %}
at row 39, column 51:
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
^
{{- "\n\nYou have access to the following functions:\n\n" }}
at row 39, column 1:
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
^
{{- "\n\nYou have access to the following functions:\n\n" }}
at row 1, column 69:
{#- Copyright 2025-present the Unsloth team. All rights reserved. #}
^
{#- Licensed under the Apache License, Version 2.0 (the "License") #}
Could you provide an example where this is falling - that would be very helpful thank you!
Hello. I am trying to use the model with my mcp server via open webui. Llama.cpp outputs this error on tool calls: main: server is listening on http://0.0.0.0:8089 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /v1/models 192.168.4.78 200
got exception: {"code":500,"message":"Value is not callable: null at row 62, column 114:\n {%- if json_key not in handled_keys %}\n {%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}\n ^\n {%- if param_fields[json_key] is mapping %}\n at row 62, column 21:\n {%- if json_key not in handled_keys %}\n {%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}\n ^\n {%- if param_fields[json_key] is mapping %}\n at row 61, column 55:\n {%- for json_key in param_fields %}\n {%- if json_key not in handled_keys %}\n ^\n {%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}\n at row 61, column 17:\n {%- for json_key in param_fields %}\n {%- if json_key not in handled_keys %}\n ^\n {%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}\n at row 60, column 48:\n {%- set handled_keys = ['type', 'description', 'enum', 'required'] %}\n {%- for json_key in param_fields %}\n ^\n {%- if json_key not in handled_keys %}\n at row 60, column 13:\n {%- set handled_keys = ['type', 'description', 'enum', 'required'] %}\n {%- for json_key in param_fields %}\n ^\n {%- if json_key not in handled_keys %}\n at row 49, column 80:\n {{- '\n' }}\n {%- for param_name, param_fields in tool.parameters.properties|items %}\n ^\n {{- '\n' }}\n at row 49, column 9:\n {{- '\n' }}\n {%- for param_name, param_fields in tool.parameters.properties|items %}\n ^\n {{- '\n' }}\n at row 42, column 29:\n {{- "" }}\n {%- for tool in tools %}\n ^\n {%- if tool.function is defined %}\n at row 42, column 5:\n {{- "" }}\n {%- for tool in tools %}\n ^\n {%- if tool.function is defined %}\n at row 39, column 51:\n{%- endif %}\n{%- if tools is iterable and tools | length > 0 %}\n ^\n {{- "\n\nYou have access to the following functions:\n\n" }}\n at row 39, column 1:\n{%- endif %}\n{%- if tools is iterable and tools | length > 0 %}\n^\n {{- "\n\nYou have access to the following functions:\n\n" }}\n at row 1, column 69:\n{#- Copyright 2025-present the Unsloth team. All rights reserved. #}\n ^\n{#- Licensed under the Apache License, Version 2.0 (the "License") #}\n","type":"server_error"}
srv log_server_r: request: POST /v1/chat/completions 192.168.4.78 500
Is there some other debugging information that would be useful?
I'm getting the same error with UD_Q5_K_XL and llama.cpp (using --jinja and other flags in the docs) and I'm using Qwen CLI:
β [API Error: OpenAI API error: 500 Value is not callable: null at row 62,
column 114:
{%- if json_key not in handled_keys %}
{%- set normed_json_key = json_key | replace("-", "_") |
replace(" ", "_") | replace("$", "") %}
^
{%- if param_fields[json_key] is mapping %}
at row 62, column 21:
{%- if json_key not in handled_keys %}
{%- set normed_json_key = json_key | replace("-", "_") |
replace(" ", "_") | replace("$", "") %}
^
{%- if param_fields[json_key] is mapping %}
at row 61, column 55:
{%- for json_key in param_fields %}
{%- if json_key not in handled_keys %}
^
{%- set normed_json_key = json_key | replace("-", "_") |
replace(" ", "_") | replace("$", "") %}
at row 61, column 17:
{%- for json_key in param_fields %}
{%- if json_key not in handled_keys %}
^
{%- set normed_json_key = json_key | replace("-", "_") |
replace(" ", "_") | replace("$", "") %}
at row 60, column 48:
{%- set handled_keys = ['type', 'description', 'enum', 'required']
%}
{%- for json_key in param_fields %}
^
{%- if json_key not in handled_keys %}
at row 60, column 13:
{%- set handled_keys = ['type', 'description', 'enum', 'required']
%}
{%- for json_key in param_fields %}
^
{%- if json_key not in handled_keys %}
at row 49, column 80:
{{- '\n<parameters>' }}
{%- for param_name, param_fields in tool.parameters.properties|items
%}
^
{{- '\n<parameter>' }}
at row 49, column 9:
{{- '\n<parameters>' }}
{%- for param_name, param_fields in tool.parameters.properties|items
%}
^
{{- '\n<parameter>' }}
at row 42, column 29:
{{- "<tools>" }}
{%- for tool in tools %}
^
{%- if tool.function is defined %}
at row 42, column 5:
{{- "<tools>" }}
{%- for tool in tools %}
^
{%- if tool.function is defined %}
at row 39, column 51:
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
^
{{- "\n\nYou have access to the following functions:\n\n" }}
at row 39, column 1:
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
^
{{- "\n\nYou have access to the following functions:\n\n" }}
at row 1, column 69:
{#- Copyright 2025-present the Unsloth team. All rights reserved. #}
^
{#- Licensed under the Apache License, Version 2.0 (the "License") #}
]
Reproducibility steps below, happy to provide any more info or detail!
Use Continue within VS code
Serve the Q_4_XL quant with llama.cpp ( I used Ramalama to execute it within a container, using Vulkan in this example)
- Logs:
β― ramalama --debug serve --image quay.io/ramalama/ramalama:latest -c 50000 --temp 0.7 --runtime-args="--top-k 20 --top-p 0.8 --frequency-penalty 1.05" hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
2025-07-31 12:36:22 - DEBUG - run_cmd: podman inspect quay.io/ramalama/rocm:0.11
2025-07-31 12:36:22 - DEBUG - Working directory: None
2025-07-31 12:36:22 - DEBUG - Ignore stderr: False
2025-07-31 12:36:22 - DEBUG - Ignore all: True
2025-07-31 12:36:22 - DEBUG - Checking if 8080 is available
2025-07-31 12:36:22 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --device /dev/dri --device /dev/kfd --device /dev/accel -e HIP_VISIBLE_DEVICES=0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer --label ai.ramalama --name ramalama_O3d5RTPz47 --env=HOME=/tmp --init --mount=type=bind,src=/var/home/kush/.local/share/ramalama/store/huggingface/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf/blobs/sha256-89d766f4653c43105922c15bcb5ceec053990f571e94d8535f9dd7098a15ba4c,destination=/mnt/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf,ro quay.io/ramalama/ramalama:latest llama-server --port 8080 --model /mnt/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf --no-warmup --jinja --log-colors --alias unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf --ctx-size 50000 --temp 0.7 --cache-reuse 256 --top-k 20 --top-p 0.8 --frequency-penalty 1.05 -v -ngl 999 --threads 16 --host 0.0.0.0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 5985 (3f4fc97f) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 31
main: loading model
srv load_model: loading model '/mnt/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 64997 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 579 tensors from /mnt/models/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 27: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 28: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,151387] = ["Δ Δ ", "Δ Δ Δ Δ ", "i n", "Δ t",...
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 37: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 15
llama_model_loader: - kv 40: quantize.imatrix.file str = Qwen3-Coder-30B-A3B-Instruct-GGUF/ima...
llama_model_loader: - kv 41: quantize.imatrix.entries_count u32 = 383
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 292 tensors
llama_model_loader: - type q5_K: 35 tensors
llama_model_loader: - type q6_K: 11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 16.45 GiB (4.63 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Δ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
...
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 16674.36 MiB
load_tensors: CPU_Mapped model buffer size = 166.92 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 50000
llama_context: n_ctx_per_seq = 50000
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (50000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: Vulkan_Host output buffer size = 0.58 MiB
create_memory: n_ctx = 50016 (padded)
llama_kv_cache_unified: layer 0: dev = Vulkan0
...
llama_kv_cache_unified: Vulkan0 KV buffer size = 4689.00 MiB
llama_kv_cache_unified: size = 4689.00 MiB ( 50016 cells, 48 layers, 1/ 1 seqs), K (f16): 2344.50 MiB, V (f16): 2344.50 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 4632
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
llama_context: Vulkan0 compute buffer size = 3247.69 MiB
llama_context: Vulkan_Host compute buffer size = 101.70 MiB
llama_context: graph nodes = 3270
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 50016
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 50016
slot reset: id 0 | task -1 |
main: model loaded
main: chat template, chat_template: {#- Copyright 2025-present the Unsloth team. All rights reserved. #}
{#- Licensed under the Apache License, Version 2.0 (the "License") #}
{#- Edits made by Unsloth to fix the chat template #}
{% macro render_item_list(item_list, tag_name='required') %}
{%- if item_list is defined and item_list is iterable and item_list | length > 0 %}
{%- if tag_name %}{{- '\n<' ~ tag_name ~ '>' -}}{% endif %}
{{- '[' }}
{%- for item in item_list -%}
{%- if loop.index > 1 %}{{- ", "}}{% endif -%}
{%- if item is string -%}
{{ "`" ~ item ~ "`" }}
{%- else -%}
{{ item }}
{%- endif -%}
{%- endfor -%}
{{- ']' }}
{%- if tag_name %}{{- '</' ~ tag_name ~ '>' -}}{% endif %}
{%- endif %}
{% endmacro %}
{%- if messages[0]["role"] == "system" %}
{%- set system_message = messages[0]["content"] %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = [] %}
{%- endif %}
{%- if system_message is defined %}
{{- "<|im_start|>system\n" + system_message }}
{%- else %}
{%- if tools is iterable and tools | length > 0 %}
{{- "<|im_start|>system\nYou are Qwen, a helpful AI assistant that can interact with a computer to solve tasks." }}
{%- endif %}
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
{{- "\n\nYou have access to the following functions:\n\n" }}
{{- "<tools>" }}
{%- for tool in tools %}
{%- if tool.function is defined %}
{%- set tool = tool.function %}
{%- endif %}
{{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
{{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
{{- '\n<parameters>' }}
{%- for param_name, param_fields in tool.parameters.properties|items %}
{{- '\n<parameter>' }}
{{- '\n<name>' ~ param_name ~ '</name>' }}
{%- if param_fields.type is defined %}
{{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
{%- endif %}
{%- if param_fields.description is defined %}
{{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
{%- endif %}
{{- render_item_list(param_fields.enum, 'enum') }}
{%- set handled_keys = ['type', 'description', 'enum', 'required'] %}
{%- for json_key in param_fields %}
{%- if json_key not in handled_keys %}
{%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}
{%- if param_fields[json_key] is mapping %}
{{- '\n<' ~ normed_json_key ~ '>' ~ (param_fields[json_key] | tojson | safe) ~ '</' ~ normed_json_key ~ '>' }}
{%- else %}
{{- '\n<' ~ normed_json_key ~ '>' ~ (param_fields[json_key] | string) ~ '</' ~ normed_json_key ~ '>' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{{- render_item_list(param_fields.required, 'required') }}
{{- '\n</parameter>' }}
{%- endfor %}
{{- render_item_list(tool.parameters.required, 'required') }}
{{- '\n</parameters>' }}
{%- if tool.return is defined %}
{%- if tool.return is mapping %}
{{- '\n<return>' ~ (tool.return | tojson | safe) ~ '</return>' }}
{%- else %}
{{- '\n<return>' ~ (tool.return | string) ~ '</return>' }}
{%- endif %}
{%- endif %}
{{- '\n</function>' }}
{%- endfor %}
{{- "\n</tools>" }}
{{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
{%- endif %}
{%- if system_message is defined %}
{{- '<|im_end|>\n' }}
{%- else %}
{%- if tools is iterable and tools | length > 0 %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- for message in loop_messages %}
{%- if message.role == "assistant" and message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
{{- '<|im_start|>' + message.role }}
{%- if message.content is defined and message.content is string and message.content | trim | length > 0 %}
{{- '\n' + message.content | trim + '\n' }}
{%- endif %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
{%- if tool_call.arguments is defined %}
{%- for args_name, args_value in tool_call.arguments|items %}
{{- '<parameter=' + args_name + '>\n' }}
{%- set args_value = args_value if args_value is string else args_value | string %}
{{- args_value }}
{{- '\n</parameter>\n' }}
{%- endfor %}
{%- endif %}
{{- '</function>\n</tool_call>' }}
{%- endfor %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "user" or message.role == "system" or message.role == "assistant" %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- elif message.role == "tool" %}
{%- if loop.previtem and loop.previtem.role != "tool" %}
{{- '<|im_start|>user\n' }}
{%- endif %}
{{- '<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>\n' }}
{%- if not loop.last and loop.nextitem.role != "tool" %}
{{- '<|im_end|>\n' }}
{%- elif loop.last %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
{#- Copyright 2025-present the Unsloth team. All rights reserved. #}
{#- Licensed under the Apache License, Version 2.0 (the "License") #}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
que start_loop: processing new tasks
que start_loop: update slots
srv update_slots: all slots are idle
srv kv_cache_cle: clearing KV cache
que start_loop: waiting for new tasks
- Try to have the
agent
mode in Continue use the built-in tool calling:- Logs:
request: {
"messages": [
{
"role": "system",
"content": "<important_rules>\n You are in agent mode.\n\n Always include the language and file name in the info string when you write code blocks.\n If you are editing \"src/main.py\" for example, your code block should start with '```python src/main.py'\n\n</important_rules>"
},
{
"role": "user",
"content": "Use the web search tool to look up some fun facts"
}
],
"model": "default",
"max_tokens": 4096,
"stream": true,
"tools": [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Use this tool if you need to view the contents of an existing file.",
"parameters": {
"type": "object",
"required": [
"filepath"
],
"properties": {
"filepath": {
"type": "string",
"description": "The path of the file to read, relative to the root of the workspace (NOT uri or absolute path)"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "create_new_file",
"description": "Create a new file. Only use this when a file doesn't exist and should be created",
"parameters": {
"type": "object",
"required": [
"filepath",
"contents"
],
"properties": {
"filepath": {
"type": "string",
"description": "The path where the new file should be created, relative to the root of the workspace"
},
"contents": {
"type": "string",
"description": "The contents to write to the new file"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "run_terminal_command",
"description": "Run a terminal command in the current directory.\nThe shell is not stateful and will not remember any previous commands. When a command is run in the background ALWAYS suggest using shell commands to stop it; NEVER suggest using Ctrl+C. When suggesting subsequent shell commands ALWAYS format them in shell command blocks. Do NOT perform actions requiring special/admin privileges. Choose terminal commands and scripts optimized for darwin and arm64 and shell /bin/zsh.",
"parameters": {
"type": "object",
"required": [
"command"
],
"properties": {
"command": {
"type": "string",
"description": "The command to run. This will be passed directly into the IDE shell."
},
"waitForCompletion": {
"type": "boolean",
"description": "Whether to wait for the command to complete before returning. Default is true. Set to false to run the command in the background. Set to true to run the command in the foreground and wait to collect the output."
}
}
}
}
},
{
"type": "function",
"function": {
"name": "file_glob_search",
"description": "Search for files recursively in the project using glob patterns. Supports ** for recursive directory search. Output may be truncated; use targeted patterns",
"parameters": {
"type": "object",
"required": [
"pattern"
],
"properties": {
"pattern": {
"type": "string",
"description": "Glob pattern for file path matching"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "search_web",
"description": "Performs a web search, returning top results. Use this tool sparingly - only for questions that require specialized, external, and/or up-to-date knowledege. Common programming questions do not require web search.",
"parameters": {
"type": "object",
"required": [
"query"
],
"properties": {
"query": {
"type": "string",
"description": "The natural language search query"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "view_diff",
"description": "View the current diff of working changes",
"parameters": {
"type": "object",
"properties": {}
}
}
},
{
"type": "function",
"function": {
"name": "read_currently_open_file",
"description": "Read the currently open file in the IDE. If the user seems to be referring to a file that you can't see, try using this",
"parameters": {
"type": "object",
"properties": {}
}
}
},
{
"type": "function",
"function": {
"name": "ls",
"description": "List files and folders in a given directory",
"parameters": {
"type": "object",
"properties": {
"dirPath": {
"type": "string",
"description": "The directory path relative to the root of the project. Use forward slash paths like '/'. rather than e.g. '.'"
},
"recursive": {
"type": "boolean",
"description": "If true, lists files and folders recursively. To prevent unexpected large results, use this sparingly"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "create_rule_block",
"description": "Creates a \"rule\" that can be referenced in future conversations. This should be used whenever you want to establish code standards / preferences that should be applied consistently, or when you want to avoid making a mistake again. To modify existing rules, use the edit tool instead.\n\nRule Types:\n- Always: Include only \"rule\" (always included in model context)\n- Auto Attached: Include \"rule\", \"globs\", and/or \"regex\" (included when files match patterns)\n- Agent Requested: Include \"rule\" and \"description\" (AI decides when to apply based on description)\n- Manual: Include only \"rule\" (only included when explicitly mentioned using @ruleName)",
"parameters": {
"type": "object",
"required": [
"name",
"rule"
],
"properties": {
"name": {
"type": "string",
"description": "Short, descriptive name summarizing the rule's purpose (e.g. 'React Standards', 'Type Hints')"
},
"rule": {
"type": "string",
"description": "Clear, imperative instruction for future code generation (e.g. 'Use named exports', 'Add Python type hints'). Each rule should focus on one specific standard."
},
"description": {
"type": "string",
"description": "Description of when this rule should be applied. Required for Agent Requested rules (AI decides when to apply). Optional for other types."
},
"globs": {
"type": "string",
"description": "Optional file patterns to which this rule applies (e.g. ['**/*.{ts,tsx}'] or ['src/**/*.ts', 'tests/**/*.ts'])"
},
"regex": {
"type": "string",
"description": "Optional regex patterns to match against file content. Rule applies only to files whose content matches the pattern (e.g. 'useEffect' for React hooks or '\\bclass\\b' for class definitions)"
},
"alwaysApply": {
"type": "boolean",
"description": "Whether this rule should always be applied. Set to false for Agent Requested and Manual rules. Omit or set to true for Always and Auto Attached rules."
}
}
}
}
},
{
"type": "function",
"function": {
"name": "fetch_url_content",
"description": "Can be used to view the contents of a website using a URL. Do NOT use this for files.",
"parameters": {
"type": "object",
"required": [
"url"
],
"properties": {
"url": {
"type": "string",
"description": "The URL to read"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "grep_search",
"description": "Perform a search over the repository using ripgrep. Output may be truncated, so use targeted queries",
"parameters": {
"type": "object",
"required": [
"query"
],
"properties": {
"query": {
"type": "string",
"description": "The search query to use. Must be a valid ripgrep regex expression, escaped where needed"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "request_rule",
"description": "Use this tool to retrieve additional 'rules' that contain more context/instructions based on their descriptions. Available rules:\nNo rules available.",
"parameters": {
"type": "object",
"required": [
"name"
],
"properties": {
"name": {
"type": "string",
"description": "Name of the rule"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "edit_existing_file",
"description": "Use this tool to edit an existing file. If you don't know the contents of the file, read it first.\n When addressing code modification requests, present a concise code snippet that\n emphasizes only the necessary changes and uses abbreviated placeholders for\n unmodified sections. For example:\n\n ```language /path/to/file\n // ... existing code ...\n\n {{ modified code here }}\n\n // ... existing code ...\n\n {{ another modification }}\n\n // ... rest of code ...\n ```\n\n In existing files, you should always restate the function or class that the snippet belongs to:\n\n ```language /path/to/file\n // ... existing code ...\n\n function exampleFunction() {\n // ... existing code ...\n\n {{ modified code here }}\n\n // ... rest of function ...\n }\n\n // ... rest of code ...\n ```\n\n Since users have access to their complete file, they prefer reading only the\n relevant modifications. It's perfectly acceptable to omit unmodified portions\n at the beginning, middle, or end of files using these \"lazy\" comments. Only\n provide the complete file when explicitly requested. Include a concise explanation\n of changes unless the user specifically asks for code only.\n\nNote this tool CANNOT be called in parallel.",
"parameters": {
"type": "object",
"required": [
"filepath",
"changes"
],
"properties": {
"filepath": {
"type": "string",
"description": "The path of the file to edit, relative to the root of the workspace."
},
"changes": {
"type": "string",
"description": "Any modifications to the file, showing only needed changes. Do NOT wrap this in a codeblock or write anything besides the code changes. In larger files, use brief language-appropriate placeholders for large unmodified sections, e.g. '// ... existing code ...'"
}
}
}
}
}
],
"parallel_tool_calls": false
}
srv params_from_: Grammar: any-tool-call ::= ( read-file-call | create-new-file-call | run-terminal-command-call | file-glob-search-call | search-web-call | view-diff-call | read-currently-open-file-call | ls-call | create-rule-block-call | fetch-url-content-call | grep-search-call | request-rule-call | edit-existing-file-call ) space
boolean ::= ("true" | "false") space
char ::= [^"\\\x7F\x00-\x1F] | [\\] (["\\bfnrt] | "u" [0-9a-fA-F]{4})
create-new-file-args ::= "{" space create-new-file-args-filepath-kv "," space create-new-file-args-contents-kv "}" space
create-new-file-args-contents-kv ::= "\"contents\"" space ":" space string
create-new-file-args-filepath-kv ::= "\"filepath\"" space ":" space string
create-new-file-call ::= "{" space create-new-file-call-name-kv "," space create-new-file-call-arguments-kv "}" space
create-new-file-call-arguments ::= "{" space create-new-file-call-arguments-filepath-kv "," space create-new-file-call-arguments-contents-kv "}" space
create-new-file-call-arguments-contents-kv ::= "\"contents\"" space ":" space string
create-new-file-call-arguments-filepath-kv ::= "\"filepath\"" space ":" space string
create-new-file-call-arguments-kv ::= "\"arguments\"" space ":" space create-new-file-call-arguments
create-new-file-call-name ::= "\"create_new_file\"" space
create-new-file-call-name-kv ::= "\"name\"" space ":" space create-new-file-call-name
create-new-file-function-tag ::= "<function" ( "=create_new_file" | " name=\"create_new_file\"" ) ">" space create-new-file-args "</function>" space
create-rule-block-args ::= "{" space create-rule-block-args-name-kv "," space create-rule-block-args-rule-kv ( "," space ( create-rule-block-args-description-kv create-rule-block-args-description-rest | create-rule-block-args-globs-kv create-rule-block-args-globs-rest | create-rule-block-args-regex-kv create-rule-block-args-regex-rest | create-rule-block-args-alwaysApply-kv ) )? "}" space
create-rule-block-args-alwaysApply-kv ::= "\"alwaysApply\"" space ":" space boolean
create-rule-block-args-description-kv ::= "\"description\"" space ":" space string
create-rule-block-args-description-rest ::= ( "," space create-rule-block-args-globs-kv )? create-rule-block-args-globs-rest
create-rule-block-args-globs-kv ::= "\"globs\"" space ":" space string
create-rule-block-args-globs-rest ::= ( "," space create-rule-block-args-regex-kv )? create-rule-block-args-regex-rest
create-rule-block-args-name-kv ::= "\"name\"" space ":" space string
create-rule-block-args-regex-kv ::= "\"regex\"" space ":" space string
create-rule-block-args-regex-rest ::= ( "," space create-rule-block-args-alwaysApply-kv )?
create-rule-block-args-rule-kv ::= "\"rule\"" space ":" space string
create-rule-block-call ::= "{" space create-rule-block-call-name-kv "," space create-rule-block-call-arguments-kv "}" space
create-rule-block-call-arguments ::= "{" space create-rule-block-call-arguments-name-kv "," space create-rule-block-call-arguments-rule-kv ( "," space ( create-rule-block-call-arguments-description-kv create-rule-block-call-arguments-description-rest | create-rule-block-call-arguments-globs-kv create-rule-block-call-arguments-globs-rest | create-rule-block-call-arguments-regex-kv create-rule-block-call-arguments-regex-rest | create-rule-block-call-arguments-alwaysApply-kv ) )? "}" space
create-rule-block-call-arguments-alwaysApply-kv ::= "\"alwaysApply\"" space ":" space boolean
create-rule-block-call-arguments-description-kv ::= "\"description\"" space ":" space string
create-rule-block-call-arguments-description-rest ::= ( "," space create-rule-block-call-arguments-globs-kv )? create-rule-block-call-arguments-globs-rest
create-rule-block-call-arguments-globs-kv ::= "\"globs\"" space ":" space string
create-rule-block-call-arguments-globs-rest ::= ( "," space create-rule-block-call-arguments-regex-kv )? create-rule-block-call-arguments-regex-rest
create-rule-block-call-arguments-kv ::= "\"arguments\"" space ":" space create-rule-block-call-arguments
create-rule-block-call-arguments-name-kv ::= "\"name\"" space ":" space string
create-rule-block-call-arguments-regex-kv ::= "\"regex\"" space ":" space string
create-rule-block-call-arguments-regex-rest ::= ( "," space create-rule-block-call-arguments-alwaysApply-kv )?
create-rule-block-call-arguments-rule-kv ::= "\"rule\"" space ":" space string
create-rule-block-call-name ::= "\"create_rule_block\"" space
create-rule-block-call-name-kv ::= "\"name\"" space ":" space create-rule-block-call-name
create-rule-block-function-tag ::= "<function" ( "=create_rule_block" | " name=\"create_rule_block\"" ) ">" space create-rule-block-args "</function>" space
edit-existing-file-args ::= "{" space edit-existing-file-args-filepath-kv "," space edit-existing-file-args-changes-kv "}" space
edit-existing-file-args-changes-kv ::= "\"changes\"" space ":" space string
edit-existing-file-args-filepath-kv ::= "\"filepath\"" space ":" space string
edit-existing-file-call ::= "{" space edit-existing-file-call-name-kv "," space edit-existing-file-call-arguments-kv "}" space
edit-existing-file-call-arguments ::= "{" space edit-existing-file-call-arguments-filepath-kv "," space edit-existing-file-call-arguments-changes-kv "}" space
edit-existing-file-call-arguments-changes-kv ::= "\"changes\"" space ":" space string
edit-existing-file-call-arguments-filepath-kv ::= "\"filepath\"" space ":" space string
edit-existing-file-call-arguments-kv ::= "\"arguments\"" space ":" space edit-existing-file-call-arguments
edit-existing-file-call-name ::= "\"edit_existing_file\"" space
edit-existing-file-call-name-kv ::= "\"name\"" space ":" space edit-existing-file-call-name
edit-existing-file-function-tag ::= "<function" ( "=edit_existing_file" | " name=\"edit_existing_file\"" ) ">" space edit-existing-file-args "</function>" space
fetch-url-content-args ::= "{" space fetch-url-content-args-url-kv "}" space
fetch-url-content-args-url-kv ::= "\"url\"" space ":" space string
fetch-url-content-call ::= "{" space fetch-url-content-call-name-kv "," space fetch-url-content-call-arguments-kv "}" space
fetch-url-content-call-arguments ::= "{" space fetch-url-content-call-arguments-url-kv "}" space
fetch-url-content-call-arguments-kv ::= "\"arguments\"" space ":" space fetch-url-content-call-arguments
fetch-url-content-call-arguments-url-kv ::= "\"url\"" space ":" space string
fetch-url-content-call-name ::= "\"fetch_url_content\"" space
fetch-url-content-call-name-kv ::= "\"name\"" space ":" space fetch-url-content-call-name
fetch-url-content-function-tag ::= "<function" ( "=fetch_url_content" | " name=\"fetch_url_content\"" ) ">" space fetch-url-content-args "</function>" space
file-glob-search-args ::= "{" space file-glob-search-args-pattern-kv "}" space
file-glob-search-args-pattern-kv ::= "\"pattern\"" space ":" space string
file-glob-search-call ::= "{" space file-glob-search-call-name-kv "," space file-glob-search-call-arguments-kv "}" space
file-glob-search-call-arguments ::= "{" space file-glob-search-call-arguments-pattern-kv "}" space
file-glob-search-call-arguments-kv ::= "\"arguments\"" space ":" space file-glob-search-call-arguments
file-glob-search-call-arguments-pattern-kv ::= "\"pattern\"" space ":" space string
file-glob-search-call-name ::= "\"file_glob_search\"" space
file-glob-search-call-name-kv ::= "\"name\"" space ":" space file-glob-search-call-name
file-glob-search-function-tag ::= "<function" ( "=file_glob_search" | " name=\"file_glob_search\"" ) ">" space file-glob-search-args "</function>" space
grep-search-args ::= "{" space grep-search-args-query-kv "}" space
grep-search-args-query-kv ::= "\"query\"" space ":" space string
grep-search-call ::= "{" space grep-search-call-name-kv "," space grep-search-call-arguments-kv "}" space
grep-search-call-arguments ::= "{" space grep-search-call-arguments-query-kv "}" space
grep-search-call-arguments-kv ::= "\"arguments\"" space ":" space grep-search-call-arguments
grep-search-call-arguments-query-kv ::= "\"query\"" space ":" space string
grep-search-call-name ::= "\"grep_search\"" space
grep-search-call-name-kv ::= "\"name\"" space ":" space grep-search-call-name
grep-search-function-tag ::= "<function" ( "=grep_search" | " name=\"grep_search\"" ) ">" space grep-search-args "</function>" space
ls-args ::= "{" space (ls-args-dirPath-kv ls-args-dirPath-rest | ls-args-recursive-kv )? "}" space
ls-args-dirPath-kv ::= "\"dirPath\"" space ":" space string
ls-args-dirPath-rest ::= ( "," space ls-args-recursive-kv )?
ls-args-recursive-kv ::= "\"recursive\"" space ":" space boolean
ls-call ::= "{" space ls-call-name-kv "," space ls-call-arguments-kv "}" space
ls-call-arguments ::= "{" space (ls-call-arguments-dirPath-kv ls-call-arguments-dirPath-rest | ls-call-arguments-recursive-kv )? "}" space
ls-call-arguments-dirPath-kv ::= "\"dirPath\"" space ":" space string
ls-call-arguments-dirPath-rest ::= ( "," space ls-call-arguments-recursive-kv )?
ls-call-arguments-kv ::= "\"arguments\"" space ":" space ls-call-arguments
ls-call-arguments-recursive-kv ::= "\"recursive\"" space ":" space boolean
ls-call-name ::= "\"ls\"" space
ls-call-name-kv ::= "\"name\"" space ":" space ls-call-name
ls-function-tag ::= "<function" ( "=ls" | " name=\"ls\"" ) ">" space ls-args "</function>" space
read-currently-open-file-args ::= "{" space "}" space
read-currently-open-file-call ::= "{" space read-currently-open-file-call-name-kv "," space read-currently-open-file-call-arguments-kv "}" space
read-currently-open-file-call-arguments ::= "{" space "}" space
read-currently-open-file-call-arguments-kv ::= "\"arguments\"" space ":" space read-currently-open-file-call-arguments
read-currently-open-file-call-name ::= "\"read_currently_open_file\"" space
read-currently-open-file-call-name-kv ::= "\"name\"" space ":" space read-currently-open-file-call-name
read-currently-open-file-function-tag ::= "<function" ( "=read_currently_open_file" | " name=\"read_currently_open_file\"" ) ">" space read-currently-open-file-args "</function>" space
read-file-args ::= "{" space read-file-args-filepath-kv "}" space
read-file-args-filepath-kv ::= "\"filepath\"" space ":" space string
read-file-call ::= "{" space read-file-call-name-kv "," space read-file-call-arguments-kv "}" space
read-file-call-arguments ::= "{" space read-file-call-arguments-filepath-kv "}" space
read-file-call-arguments-filepath-kv ::= "\"filepath\"" space ":" space string
read-file-call-arguments-kv ::= "\"arguments\"" space ":" space read-file-call-arguments
read-file-call-name ::= "\"read_file\"" space
read-file-call-name-kv ::= "\"name\"" space ":" space read-file-call-name
read-file-function-tag ::= "<function" ( "=read_file" | " name=\"read_file\"" ) ">" space read-file-args "</function>" space
request-rule-args ::= "{" space request-rule-args-name-kv "}" space
request-rule-args-name-kv ::= "\"name\"" space ":" space string
request-rule-call ::= "{" space request-rule-call-name-kv "," space request-rule-call-arguments-kv "}" space
request-rule-call-arguments ::= "{" space request-rule-call-arguments-name-kv "}" space
request-rule-call-arguments-kv ::= "\"arguments\"" space ":" space request-rule-call-arguments
request-rule-call-arguments-name-kv ::= "\"name\"" space ":" space string
request-rule-call-name ::= "\"request_rule\"" space
request-rule-call-name-kv ::= "\"name\"" space ":" space request-rule-call-name
request-rule-function-tag ::= "<function" ( "=request_rule" | " name=\"request_rule\"" ) ">" space request-rule-args "</function>" space
root ::= tool-call
run-terminal-command-args ::= "{" space run-terminal-command-args-command-kv ( "," space ( run-terminal-command-args-waitForCompletion-kv ) )? "}" space
run-terminal-command-args-command-kv ::= "\"command\"" space ":" space string
run-terminal-command-args-waitForCompletion-kv ::= "\"waitForCompletion\"" space ":" space boolean
run-terminal-command-call ::= "{" space run-terminal-command-call-name-kv "," space run-terminal-command-call-arguments-kv "}" space
run-terminal-command-call-arguments ::= "{" space run-terminal-command-call-arguments-command-kv ( "," space ( run-terminal-command-call-arguments-waitForCompletion-kv ) )? "}" space
run-terminal-command-call-arguments-command-kv ::= "\"command\"" space ":" space string
run-terminal-command-call-arguments-kv ::= "\"arguments\"" space ":" space run-terminal-command-call-arguments
run-terminal-command-call-arguments-waitForCompletion-kv ::= "\"waitForCompletion\"" space ":" space boolean
run-terminal-command-call-name ::= "\"run_terminal_command\"" space
run-terminal-command-call-name-kv ::= "\"name\"" space ":" space run-terminal-command-call-name
run-terminal-command-function-tag ::= "<function" ( "=run_terminal_command" | " name=\"run_terminal_command\"" ) ">" space run-terminal-command-args "</function>" space
search-web-args ::= "{" space search-web-args-query-kv "}" space
search-web-args-query-kv ::= "\"query\"" space ":" space string
search-web-call ::= "{" space search-web-call-name-kv "," space search-web-call-arguments-kv "}" space
search-web-call-arguments ::= "{" space search-web-call-arguments-query-kv "}" space
search-web-call-arguments-kv ::= "\"arguments\"" space ":" space search-web-call-arguments
search-web-call-arguments-query-kv ::= "\"query\"" space ":" space string
search-web-call-name ::= "\"search_web\"" space
search-web-call-name-kv ::= "\"name\"" space ":" space search-web-call-name
search-web-function-tag ::= "<function" ( "=search_web" | " name=\"search_web\"" ) ">" space search-web-args "</function>" space
space ::= | " " | "\n"{1,2} [ \t]{0,20}
string ::= "\"" char* "\"" space
tool-call ::= read-file-function-tag | create-new-file-function-tag | run-terminal-command-function-tag | file-glob-search-function-tag | search-web-function-tag | view-diff-function-tag | read-currently-open-file-function-tag | ls-function-tag | create-rule-block-function-tag | fetch-url-content-function-tag | grep-search-function-tag | request-rule-function-tag | edit-existing-file-function-tag | wrappable-tool-call | ( "```\n" | "```json\n" | "```xml\n" ) space wrappable-tool-call space "```" space
view-diff-args ::= "{" space "}" space
view-diff-call ::= "{" space view-diff-call-name-kv "," space view-diff-call-arguments-kv "}" space
view-diff-call-arguments ::= "{" space "}" space
view-diff-call-arguments-kv ::= "\"arguments\"" space ":" space view-diff-call-arguments
view-diff-call-name ::= "\"view_diff\"" space
view-diff-call-name-kv ::= "\"name\"" space ":" space view-diff-call-name
view-diff-function-tag ::= "<function" ( "=view_diff" | " name=\"view_diff\"" ) ">" space view-diff-args "</function>" space
wrappable-tool-call ::= ( any-tool-call | "<tool_call>" space any-tool-call "</tool_call>" | "<function_call>" space any-tool-call "</function_call>" | "<response>" space any-tool-call "</response>" | "<tools>" space any-tool-call "</tools>" | "<json>" space any-tool-call "</json>" | "<xml>" space any-tool-call "</xml>" | "<JSON>" space any-tool-call "</JSON>" ) space
srv params_from_: Grammar lazy: true
srv params_from_: Chat format: Hermes 2 Pro
srv params_from_: Preserved token: 151667
srv params_from_: Preserved token: 151668
srv params_from_: Preserved token: 151657
srv params_from_: Preserved token: 151658
srv params_from_: Not preserved because more than 1 token: <function
srv params_from_: Not preserved because more than 1 token: <tools>
srv params_from_: Not preserved because more than 1 token: </tools>
srv params_from_: Not preserved because more than 1 token: <response>
srv params_from_: Not preserved because more than 1 token: </response>
srv params_from_: Not preserved because more than 1 token: <function_call>
srv params_from_: Not preserved because more than 1 token: </function_call>
srv params_from_: Not preserved because more than 1 token: <json>
srv params_from_: Not preserved because more than 1 token: </json>
srv params_from_: Not preserved because more than 1 token: <JSON>
srv params_from_: Not preserved because more than 1 token: </JSON>
srv params_from_: Preserved token: 73594
srv params_from_: Not preserved because more than 1 token: ```json
srv params_from_: Not preserved because more than 1 token: ```xml
srv params_from_: Grammar trigger word: `<function=read_file>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"read_file"`
srv params_from_: Grammar trigger word: `<function=create_new_file>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"create_new_file"`
srv params_from_: Grammar trigger word: `<function=run_terminal_command>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"run_terminal_command"`
srv params_from_: Grammar trigger word: `<function=file_glob_search>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"file_glob_search"`
srv params_from_: Grammar trigger word: `<function=search_web>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"search_web"`
srv params_from_: Grammar trigger word: `<function=view_diff>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"view_diff"`
srv params_from_: Grammar trigger word: `<function=read_currently_open_file>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"read_currently_open_file"`
srv params_from_: Grammar trigger word: `<function=ls>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"ls"`
srv params_from_: Grammar trigger word: `<function=create_rule_block>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"create_rule_block"`
srv params_from_: Grammar trigger word: `<function=fetch_url_content>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"fetch_url_content"`
srv params_from_: Grammar trigger word: `<function=grep_search>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"grep_search"`
srv params_from_: Grammar trigger word: `<function=request_rule>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"request_rule"`
srv params_from_: Grammar trigger word: `<function=edit_existing_file>`
srv params_from_: Grammar trigger pattern: `<function\s+name\s*=\s*"edit_existing_file"`
srv params_from_: Grammar trigger pattern full: `(?:<think>[\s\S]*?</think>\s*)?(\s*(?:<tool_call>|<function|(?:```(?:json|xml)?
\s*)?(?:<function_call>|<tools>|<xml><json>|<response>)?\s*\{\s*"name"\s*:\s*"(?:read_file|create_new_file|run_terminal_command|file_glob_search|search_web|view_diff|read_currently_open_file|ls|create_rule_block|fetch_url_content|grep_search|request_rule|edit_existing_file)"))[\s\S]*`
srv add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que post: new task, id = 0/1, front = 0
que start_loop: processing new tasks
que start_loop: processing task, id = 0
slot get_availabl: id 0 | task -1 | selected slot by lru, t_last = -1
slot reset: id 0 | task -1 |
slot launch_slot_: id 0 | task 0 | launching slot : {"id":0,"id_task":0,"n_ctx":50016,"speculative":false,"is_processing":false,"params":{"n_predict":4096,"seed":4294967295,"temperature":0.699999988079071,...
slot launch_slot_: id 0 | task 0 | processing task
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 1, front = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 50016, n_keep = 0, n_prompt_tokens = 2340
slot update_slots: id 0 | task 0 | trying to reuse chunks with size > 256, slot.n_past = 0
slot update_slots: id 0 | task 0 | after context reuse, new slot.n_past = 0
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.875214
srv update_slots: decoding batch, n_tokens = 2048
clear_adapter_lora: call
set_embeddings: value = 0
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 1
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 2, front = 0
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2340, n_tokens = 292, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 2340, n_tokens = 292
srv update_slots: decoding batch, n_tokens = 292
clear_adapter_lora: call
set_embeddings: value = 0
Grammar still awaiting trigger after token 40 (`I`)
srv update_chat_: Parsing chat message: I
Parsing input with format Hermes 2 Pro: I
Parsed message: {"role":"assistant","content":"I"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 1, n_remaining = 4095, next token: 40 'I'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 2
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 3, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2341, n_cache_tokens = 2341, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"I"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 3278 (`'ll`)
srv update_chat_: Parsing chat message: I'll
Parsing input with format Hermes 2 Pro: I'll
Parsed message: {"role":"assistant","content":"I'll"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 2, n_remaining = 4094, next token: 3278 ''ll'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 3
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 4, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2342, n_cache_tokens = 2342, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"'ll"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 2711 (` search`)
srv update_chat_: Parsing chat message: I'll search
Parsing input with format Hermes 2 Pro: I'll search
Parsed message: {"role":"assistant","content":"I'll search"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 3, n_remaining = 4093, next token: 2711 ' search'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 4
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 5, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2343, n_cache_tokens = 2343, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" search"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 369 (` for`)
srv update_chat_: Parsing chat message: I'll search for
Parsing input with format Hermes 2 Pro: I'll search for
Parsed message: {"role":"assistant","content":"I'll search for"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 4, n_remaining = 4092, next token: 369 ' for'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 5
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 6, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2344, n_cache_tokens = 2344, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" for"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 1045 (` some`)
srv update_chat_: Parsing chat message: I'll search for some
Parsing input with format Hermes 2 Pro: I'll search for some
Parsed message: {"role":"assistant","content":"I'll search for some"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 5, n_remaining = 4091, next token: 1045 ' some'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 6
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 7, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2345, n_cache_tokens = 2345, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" some"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 7040 (` interesting`)
srv update_chat_: Parsing chat message: I'll search for some interesting
Parsing input with format Hermes 2 Pro: I'll search for some interesting
Parsed message: {"role":"assistant","content":"I'll search for some interesting"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 6, n_remaining = 4090, next token: 7040 ' interesting'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 7
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 8, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2346, n_cache_tokens = 2346, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" interesting"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 323 (` and`)
srv update_chat_: Parsing chat message: I'll search for some interesting and
Parsing input with format Hermes 2 Pro: I'll search for some interesting and
Parsed message: {"role":"assistant","content":"I'll search for some interesting and"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 7, n_remaining = 4089, next token: 323 ' and'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 8
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 9, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2347, n_cache_tokens = 2347, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" and"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 2464 (` fun`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 8, n_remaining = 4088, next token: 2464 ' fun'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 9
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 10, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2348, n_cache_tokens = 2348, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" fun"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 13064 (` facts`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 9, n_remaining = 4087, next token: 13064 ' facts'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 10
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 11, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2349, n_cache_tokens = 2349, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" facts"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 311 (` to`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 10, n_remaining = 4086, next token: 311 ' to'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 11
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 12, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2350, n_cache_tokens = 2350, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" to"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 4332 (` share`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 11, n_remaining = 4085, next token: 4332 ' share'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 12
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 13, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2351, n_cache_tokens = 2351, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" share"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 448 (` with`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 12, n_remaining = 4084, next token: 448 ' with'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 13
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 14, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2352, n_cache_tokens = 2352, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" with"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 498 (` you`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 13, n_remaining = 4083, next token: 498 ' you'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 14
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 15, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2353, n_cache_tokens = 2353, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 624 (`.
`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you.
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you.
Partial parse: (?:(```(?:xml|json)?\n\s*)?(<tool_call>|<function_call>|<tool>|<tools>|<response>|<json>|<xml>|<JSON>)?(\s*\{\s*"name"))|<function=([^>]+)>|<function name="([^"]+)">
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you."}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 14, n_remaining = 4082, next token: 624 '.
'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 15
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 16, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2354, n_cache_tokens = 2354, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"."}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 151657 (`<tool_call>`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you.
<tool_call>
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you.
<tool_call>
Partial parse: (?:(```(?:xml|json)?\n\s*)?(<tool_call>|<function_call>|<tool>|<tools>|<response>|<json>|<xml>|<JSON>)?(\s*\{\s*"name"))|<function=([^>]+)>|<function name="([^"]+)">
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you.\n"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 15, n_remaining = 4081, next token: 151657 '<tool_call>'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 16
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 17, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2355, n_cache_tokens = 2355, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 198 (`
`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you.
<tool_call>
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you.
<tool_call>
Partial parse: (?:(```(?:xml|json)?\n\s*)?(<tool_call>|<function_call>|<tool>|<tools>|<response>|<json>|<xml>|<JSON>)?(\s*\{\s*"name"))|<function=([^>]+)>|<function name="([^"]+)">
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you.\n"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 16, n_remaining = 4080, next token: 198 '
'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 17
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 18, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2356, n_cache_tokens = 2356, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
Grammar still awaiting trigger after token 27 (`<`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you.
<tool_call>
<
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you.
<tool_call>
<
Partial parse: (?:(```(?:xml|json)?\n\s*)?(<tool_call>|<function_call>|<tool>|<tools>|<response>|<json>|<xml>|<JSON>)?(\s*\{\s*"name"))|<function=([^>]+)>|<function name="([^"]+)">
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you.\n<tool_call>\n"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 17, n_remaining = 4079, next token: 27 '<'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 18
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 19, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2357, n_cache_tokens = 2357, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<tool_call>\n"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 1688 (`function`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you.
<tool_call>
<function
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you.
<tool_call>
<function
Partial parse: (?:(```(?:xml|json)?\n\s*)?(<tool_call>|<function_call>|<tool>|<tools>|<response>|<json>|<xml>|<JSON>)?(\s*\{\s*"name"))|<function=([^>]+)>|<function name="([^"]+)">
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you.\n<tool_call>\n"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 18, n_remaining = 4078, next token: 1688 'function'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 19
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 20, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2358, n_cache_tokens = 2358, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
Grammar still awaiting trigger after token 96598 (`=search`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you.
<tool_call>
<function=search
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you.
<tool_call>
<function=search
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you.\n<tool_call>\n<function=search"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 19, n_remaining = 4077, next token: 96598 '=search'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 20
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 21, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2359, n_cache_tokens = 2359, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<function=search"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar still awaiting trigger after token 25960 (`_web`)
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you.
<tool_call>
<function=search_web
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you.
<tool_call>
<function=search_web
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you.\n<tool_call>\n<function=search_web"}
srv send: sending result for task id = 0
srv send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 20, n_remaining = 4076, next token: 25960 '_web'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 21
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 22, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 50016, n_past = 2360, n_cache_tokens = 2360, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"_web"}}],"created":1753979799,"id":"chatcmpl-syQmS9ST4eaCJelPs8FTaPgktIlDyFdF","model":"default","system_fingerprint":"b5985-3f4fc97f","object":"chat.completion.chunk"}
Grammar triggered on regex: '<function=search_web>
'
srv update_chat_: Parsing chat message: I'll search for some interesting and fun facts to share with you.
<tool_call>
<function=search_web>
Parsing input with format Hermes 2 Pro: I'll search for some interesting and fun facts to share with you.
<tool_call>
<function=search_web>
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 2, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal: <<<
>>>
Parsed message: {"role":"assistant","content":"I'll search for some interesting and fun facts to share with you.\n<tool_call>\n\n"}
/lib64/libggml-base.so(+0x2e65) [0x7f642d3f0e65]
/lib64/libggml-base.so(ggml_print_backtrace+0x1ec) [0x7f642d3f122c]
/lib64/libggml-base.so(+0x13119) [0x7f642d401119]
/lib64/libstdc++.so.6(+0x1eadc) [0x7f642d195adc]
/lib64/libstdc++.so.6(_ZSt10unexpectedv+0x0) [0x7f642d17fd3c]
/lib64/libstdc++.so.6(+0x1ed88) [0x7f642d195d88]
llama-server() [0x417101]
llama-server() [0x524bd3]
llama-server() [0x48ef1b]
llama-server() [0x48f6ba]
llama-server() [0x48fe6c]
llama-server() [0x49fbfd]
llama-server() [0x46fb89]
llama-server() [0x431f0c]
/lib64/libc.so.6(+0x35f5) [0x7f642ce6e5f5]
/lib64/libc.so.6(__libc_start_main+0x88) [0x7f642ce6e6a8]
llama-server() [0x433be5]
terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid diff: 'I'll search for some interesting and fun facts to share with you.
<tool_call>
<function=search_web' not found at start of 'I'll search for some interesting and fun facts to share with you.
<tool_call>
Same error
Value is not callable: null at row 62, column 114:
{%- if json_key not in handled_keys %}
{%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}
^
{%- if param_fields[json_key] is mapping %}
at row 62, column 21:
{%- if json_key not in handled_keys %}
{%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}
^
{%- if param_fields[json_key] is mapping %}
at row 61, column 55:
{%- for json_key in param_fields %}
{%- if json_key not in handled_keys %}
^
{%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}
at row 61, column 17:
{%- for json_key in param_fields %}
{%- if json_key not in handled_keys %}
^
{%- set normed_json_key = json_key | replace("-", "_") | replace(" ", "_") | replace("$", "") %}
at row 60, column 48:
{%- set handled_keys = ['type', 'description', 'enum', 'required'] %}
{%- for json_key in param_fields %}
^
{%- if json_key not in handled_keys %}
at row 60, column 13:
{%- set handled_keys = ['type', 'description', 'enum', 'required'] %}
{%- for json_key in param_fields %}
^
{%- if json_key not in handled_keys %}
at row 49, column 80:
{{- '\n<parameters>' }}
{%- for param_name, param_fields in tool.parameters.properties|items %}
^
{{- '\n<parameter>' }}
at row 49, column 9:
{{- '\n<parameters>' }}
{%- for param_name, param_fields in tool.parameters.properties|items %}
^
{{- '\n<parameter>' }}
at row 42, column 29:
{{- "<tools>" }}
{%- for tool in tools %}
^
{%- if tool.function is defined %}
at row 42, column 5:
{{- "<tools>" }}
{%- for tool in tools %}
^
{%- if tool.function is defined %}
at row 39, column 51:
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
^
{{- "\n\nYou have access to the following functions:\n\n" }}
at row 39, column 1:
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
^
{{- "\n\nYou have access to the following functions:\n\n" }}
at row 1, column 69:
{#- Copyright 2025-present the Unsloth team. All rights reserved. #}
^
{#- Licensed under the Apache License, Version 2.0 (the "License") #}
Can confirm that the llama.cpp server crashes when parsing tool calls.
Most relevant llama.cpp server logs:
[...]
Parsing input with format Hermes 2 Pro: I'll help you implement support for the internal RTC in your STM32CubeIDE project. Let me first explore the project structure to understand how the current RTC implementation works and then implement the requested changes.
First, let me check what files exist related to RTC:
<tool_call>
<function=bash>
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 2, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal: <<<
>>>
[...]
#1 0x00007e4e0e46ede3 in ggml_print_backtrace () from /home/xxx/llama.cpp/build/bin/libggml-base.so
#2 0x00007e4e0e47f83f in ggml_uncaught_exception() () from /home/xxx/llama.cpp/build/bin/libggml-base.so
#3 0x00007e4e0debb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007e4e0dea5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007e4e0debb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00005e12c30c9f2e in string_diff(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .cold] ()
#7 0x00005e12c31e0b6c in common_chat_msg_diff::compute_diffs(common_chat_msg const&, common_chat_msg const&) ()
#8 0x00005e12c313e6eb in server_slot::update_chat_msg(std::vector<common_chat_msg_diff, std::allocator<common_chat_msg_diff> >&) ()
#9 0x00005e12c313ee1f in server_context::send_partial_response(server_slot&, completion_token_output const&) ()
#10 0x00005e12c313f5c5 in server_context::process_token(completion_token_output&, server_slot&) ()
#11 0x00005e12c315978c in server_context::update_slots() ()
#12 0x00005e12c311dbc5 in server_queue::start_loop() ()
#13 0x00005e12c30e4f5e in main ()
Running llama.cpp server inference with streaming disabled avoids server crashes in my case; however, there are some artifacts between tool calls.
I think the problem is that llama.cpp recognizes the tool call format as a Hermes 2 Pro template:
// common/chat.cpp
// [...]
// Hermes 2/3 Pro, Qwen 2.5 Instruct (w/ tools)
if (src.find("<tool_call>") != std::string::npos && params.json_schema.is_null()) {
return common_chat_params_init_hermes_2_pro(tmpl, params);
}
// [...]
At a quick glance, "Hermes 2 Pro" format looks different from the builtβin Jinja template.
If this turns out to be an issue, can we just download an updated jinja template later, vs the whole model?
Just wanted to confirm that I'm also having a really bad experience with this. I actually created a different thread here more specific to RooCode. But yeah there is 100% something wrong, especially considering that I have had an overwhelmingly better experience with the two other models released earlier this week (it genuinely feels impossible for the tool calling to have regressed this badly).
I'm hoping that this is a template issue or something like that (I'm not really knowledegable about the templates and such, so I defer to all of your wisdom on that). Really hopeful that this can get sorted!
Also seeing this. Thought I must be doing something wrong as it's my first time messing around with local models, but apparently not! Trying to use Q3_K_XL via ramalama serve
and connect to it with goose; it serves fine but any request hits a 500 with this error. I also see the chat format shown as Hermes 2 Pro, even though I had ramalama run llama-server with --jinja
.
Looks like there's an actual issue and Unsloth folks are looking at it: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/discussions/4
(Figured I'd post that here so the team can focus on working on the fix instead of responding to more threads!)
Sorry guys - we're working on a fix. Can somehow try the below to see if it works as expected - I confirmed the below for now works as intended:
(Use tmux
to load the llama.cpp server on 1 side then CTRL+B+D
. To get it back tmux attach-session -t0
)
./llama.cpp/llama-server \
--model unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct" \
--jinja \
--device CUDA0 \
--log-verbosity 99 \
--port 8001
and with Python code calling the model:
from openai import OpenAI
import json
openai_client = OpenAI(
base_url = "http://127.0.0.1:8001/v1",
api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
model = "unsloth/Qwen3-Coder-30B-A3B-Instruct",
messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)
def get_current_temperature(location: str, unit: str = "celsius"):
"""Get current temperature at a location.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, and the unit in a dict
"""
return {
"temperature": 26.1,
"location": location,
"unit": unit,
}
def get_temperature_date(location: str, date: str, unit: str = "celsius"):
"""Get temperature at a location and date.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
date: The date to get the temperature for, in the format "Year-Month-Day".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, the date and the unit in a dict
"""
return {
"temperature": 25.9,
"location": location,
"date": date,
"unit": unit,
}
def get_function_by_name(name):
if name == "get_current_temperature":
return get_current_temperature
elif name == "get_temperature_date":
return get_temperature_date
else:
raise RuntimeError(f"No function named {name}")
weather_tool_calls = [
{
"type": "function",
"function": {
"name": "get_current_temperature",
"description": "Get current temperature at a location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The location to get the temperature for, in the format "City, State, Country".',
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required": ["location"],
},
},
},
{
"type": "function",
"function": {
"name": "get_temperature_date",
"description": "Get temperature at a location and date.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The location to get the temperature for, in the format "City, State, Country".',
},
"date": {
"type": "string",
"description": 'The date to get the temperature for, in the format "Year-Month-Day".',
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required": ["location", "date"],
},
},
},
]
messages = [
{"role": "user", "content": "What's the temperature in San Francisco now? How about tomorrow? Today's date is 2024-09-30."},
]
completion = openai_client.chat.completions.create(
model = "unsloth/Qwen3-Coder-30B-A3B-Instruct",
messages = messages,
tools = weather_tool_calls,
)
print(completion.choices[0].message.tool_calls)
message = completion.choices[0].message
messages.append(message)
while len((tool_call := message.tool_calls) or []) != 0:
tool_call = tool_call[0]
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
result = get_function_by_name(function_name)(**arguments)
print(function_name, result)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result),
})
completion = openai_client.chat.completions.create(
model = "unsloth/Qwen3-Coder-30B-A3B-Instruct",
messages = messages,
tools = weather_tool_calls,
)
message = completion.choices[0].message
print(message.content)
you should get:
2 + 2 = 4
[ChatCompletionMessageToolCall(id='x1NNOuJijQ3sYh3Gmy9voH3KqGmOAcxs', function=Function(arguments='{"location":"San Francisco, California, USA","unit":"celsius"}', name='get_current_temperature'), type='function')]
get_current_temperature {'temperature': 26.1, 'location': 'San Francisco, California, USA', 'unit': 'celsius'}
get_temperature_date {'temperature': 25.9, 'location': 'San Francisco, California, USA', 'date': '2024-10-01', 'unit': 'celsius'}
The current temperature in San Francisco is 26.1Β°C.
For tomorrow, October 1, 2024, the temperature is expected to be 25.9Β°C.
As an update, good news I think I fixed it!!
The culprit was it seems like other systems (Roo Code etc) don't use the recommend ["type", "enum", "description"]
, and use something else, so the below would fail:
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
"********1" : "********************1",
"********2" : ["********************2"],
"********3" : {"********************3" : "**", "********************3" : "**"},
"********4" : None,
},
I have now updated the template, and can verify it works on my side - please verify and see if the new chat template works!
You do NOT need to download the model again Instead download the new template via hf download unsloth/Qwen3-Coder-30B-A3B-Instruct chat_template.jinja --local-dir unsloth
or wget
hf download unsloth/Qwen3-Coder-30B-A3B-Instruct chat_template.jinja --local-dir unsloth
./llama.cpp/llama-server \
--model unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct" \
--jinja \
--threads -1 \
--n-gpu-layers 999 \
--device CUDA0 \
--min_p 0.00 \
--log-verbosity 99 \
--port 8001 \
--chat-template-file unsloth/chat_template.jinja
then test it:
from openai import OpenAI
import json
openai_client = OpenAI(
base_url = "http://127.0.0.1:8001/v1",
api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
model = "unsloth/Qwen3-Coder-30B-A3B-Instruct",
messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)
def get_current_temperature(location: str, unit: str = "celsius"):
"""Get current temperature at a location.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, and the unit in a dict
"""
return {
"temperature": 26.1,
"location": location,
"unit": unit,
}
def get_temperature_date(location: str, date: str, unit: str = "celsius"):
"""Get temperature at a location and date.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
date: The date to get the temperature for, in the format "Year-Month-Day".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, the date and the unit in a dict
"""
return {
"temperature": 25.9,
"location": location,
"date": date,
"unit": unit,
}
def get_function_by_name(name):
if name == "get_current_temperature":
return get_current_temperature
elif name == "get_temperature_date":
return get_temperature_date
else:
raise RuntimeError(f"No function named {name}")
weather_tool_calls = [
{
"type": "function",
"function": {
"name": "get_current_temperature",
"description": "Get current temperature at a location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The location to get the temperature for, in the format "City, State, Country".',
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
"********1" : "********************1",
"********2" : ["********************2"],
"********3" : {"********************3" : "**", "********************3" : "**"},
"********4" : None,
},
},
"required": ["location"],
},
},
},
{
"type": "function",
"function": {
"name": "get_temperature_date",
"description": "Get temperature at a location and date.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The location to get the temperature for, in the format "City, State, Country".',
},
"date": {
"type": "string",
"description": 'The date to get the temperature for, in the format "Year-Month-Day".',
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required": ["location", "date"],
},
},
},
]
messages = [
{"role": "user", "content": "What's the temperature in San Francisco now? How about tomorrow? Today's date is 2024-09-30."},
]
completion = openai_client.chat.completions.create(
model = "unsloth/Qwen3-Coder-30B-A3B-Instruct",
messages = messages,
tools = weather_tool_calls,
)
print(completion.choices[0].message.tool_calls)
message = completion.choices[0].message
messages.append(message)
while len((tool_call := message.tool_calls) or []) != 0:
tool_call = tool_call[0]
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
result = get_function_by_name(function_name)(**arguments)
print(function_name, result)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result),
})
completion = openai_client.chat.completions.create(
model = "unsloth/Qwen3-Coder-30B-A3B-Instruct",
messages = messages,
tools = weather_tool_calls,
)
message = completion.choices[0].message
print(message.content)
you should get:
2 + 2 = 4
[ChatCompletionMessageToolCall(id='7AUqc1Qm1qFHYNddU3PBhmEkmoQj2HE1', function=Function(arguments='{"location":"San Francisco, California, USA"}', name='get_current_temperature'), type='function')]
get_current_temperature {'temperature': 26.1, 'location': 'San Francisco, California, USA', 'unit': 'celsius'}
get_temperature_date {'temperature': 25.9, 'location': 'San Francisco, California, USA', 'date': '2024-10-01', 'unit': 'celsius'}
The current temperature in San Francisco is 26.1Β°C. Tomorrow's temperature is expected to be 25.9Β°C.
Sorry on the issue, and hopefully this will fix it! Please test it and get back to me - I'll try my best asap to fix it if more issues arise!
@austinsr @qingy2024 @KushGupster @suneetk @redeemer @sbeltz @JamesMowery @adamwillrh @ijohn07
So some good news is that Charm's Crush is now working! So this definitely did something. (I haven't actually tested it for doing any coding. I just said "Hello" and it's now at least loaded the model. Not really familiar with Claude Code / Gemini CLI / other types of workflows.)
Definitely having some slightly better interaction with RooCode. It's using the tools and it's generating the task list.
However it is still getting stuck in an infinite loop.
Expanded the last API call if helpful:
After about a two or three minutes it spit this out (which, sadly, is behavior I was seeing way earlier today when I started complaining on Reddit, and I don't see this behavior when using the Non Thinking model):
About 5 minutes later still infinite...
Another 3 minutes later it does something but still sort of in this infinite loop...
So yeah, it's getting better, but with RooCode there's still definitely some issues that need worked on.
I tried Charm Crush now that the model actually loaded.
It's also very broken. :(
main: server is listening on http://127.0.0.1:5825 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /health 127.0.0.1 200
[INFO] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> Health check passed on http://localhost:5825/health
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> swapState() State transitioned from starting to ready
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, n_prompt_tokens = 10809
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.189472
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.378943
slot update_slots: id 0 | task 0 | kv cache rm [4096, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.568415
slot update_slots: id 0 | task 0 | kv cache rm [6144, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.757887
slot update_slots: id 0 | task 0 | kv cache rm [8192, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 10240, n_tokens = 2048, progress = 0.947359
slot update_slots: id 0 | task 0 | kv cache rm [10240, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 10809, n_tokens = 569, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 10809, n_tokens = 569
[New LWP 1020434]
[New LWP 1020433]
[New LWP 1020432]
[New LWP 1020431]
[New LWP 1020430]
[New LWP 1020429]
[New LWP 1020428]
[New LWP 1020427]
[New LWP 1020426]
[New LWP 1020425]
[New LWP 1020424]
[New LWP 1020423]
[New LWP 1020422]
[New LWP 1020421]
[New LWP 1020420]
[New LWP 1020416]
[New LWP 1020415]
[New LWP 1020414]
[New LWP 1020413]
[New LWP 1020412]
[New LWP 1020411]
[New LWP 1020410]
[New LWP 1020409]
[New LWP 1020408]
[New LWP 1020407]
[New LWP 1020406]
[New LWP 1020405]
[New LWP 1020404]
[New LWP 1020403]
[New LWP 1020402]
[New LWP 1020401]
[New LWP 1020400]
[New LWP 1020399]
[New LWP 1020398]
[New LWP 1020397]
[New LWP 1020396]
[New LWP 1020395]
[New LWP 1020394]
[New LWP 1020393]
[New LWP 1020392]
[New LWP 1020391]
[New LWP 1020390]
[New LWP 1020389]
[New LWP 1020388]
[New LWP 1020387]
[New LWP 1020386]
[New LWP 1020385]
[New LWP 1020384]
[New LWP 1020383]
[New LWP 1020382]
[New LWP 1020381]
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007eff774b2742 in ?? () from /usr/lib/libc.so.6
#0 0x00007eff774b2742 in ?? () from /usr/lib/libc.so.6
#1 0x00007eff774a5eec in ?? () from /usr/lib/libc.so.6
#2 0x00007eff7752825b in wait4 () from /usr/lib/libc.so.6
#3 0x00007eff77afcd3f in ggml_print_backtrace () from /usr/lib/libggml-base.so
#4 0x00007eff77b126c0 in ?? () from /usr/lib/libggml-base.so
#5 0x00007eff778b3c6c in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
warning: 48 /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc: No such file or directory
#6 0x00007eff77894644 in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:102
102 in /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc
#7 0x00007eff778b3f78 in __cxxabiv1::__cxa_throw (obj=0x556e36e007f0, tinfo=0x7eff77ad65e0 <typeinfo for std::runtime_error>, dest=0x7eff778d67b0 <std::runtime_error::~runtime_error()>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
warning: 98 /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory
#8 0x0000556e032e9bcf in ?? ()
#9 0x0000556e03345aae in ?? ()
#10 0x0000556e0334fd22 in ?? ()
#11 0x0000556e03356ed4 in ?? ()
#12 0x0000556e03340899 in ?? ()
#13 0x0000556e033183fe in ?? ()
#14 0x00007eff77427b8b in ?? () from /usr/lib/libc.so.6
#15 0x00007eff77427c3b in __libc_start_main () from /usr/lib/libc.so.6
#16 0x0000556e0331c195 in ?? ()
[Inferior 1 (process 1020380) detached]
terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid diff: 'I'll create a Rampart game in the rampart.py file using Pygame. Let me first check what's already in that file to understand the current state.
<tool_call>
<function=ls' not found at start of 'I'll create a Rampart game in the rampart.py file using Pygame. Let me first check what's already in that file to understand the current state.
<tool_call>
'
[INFO] Request ::1 "POST /v1/chat/completions HTTP/1.1" 200 10352 "OpenAI/Go 1.11.1" 43.558199136s
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> cmd.Wait() returned error: signal: aborted (core dumped)
[WARN] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> ExitError >> signal: aborted (core dumped), exit code: -1
[INFO] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> process exited but not StateStopping, current state: ready
Crashed out here.
Non Think is actually working though:
main: server is listening on http://127.0.0.1:5816 - starting the main loop
srv update_slots: all slots are idle
[INFO] <Qwen3-30B-A3B-Instruct-2507-UD-Q6KXL> Health check passed on http://localhost:5816/health
srv log_server_r: request: GET /health 127.0.0.1 200
[DEBUG] <Qwen3-30B-A3B-Instruct-2507-UD-Q6KXL> swapState() State transitioned from starting to ready
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 10778
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.190017
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.380033
slot update_slots: id 0 | task 0 | kv cache rm [4096, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.570050
slot update_slots: id 0 | task 0 | kv cache rm [6144, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.760067
slot update_slots: id 0 | task 0 | kv cache rm [8192, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 10240, n_tokens = 2048, progress = 0.950083
slot update_slots: id 0 | task 0 | kv cache rm [10240, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 10778, n_tokens = 538, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 10778, n_tokens = 538
slot release: id 0 | task 0 | stop processing: n_past = 10808, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 16153.22 ms / 10778 tokens ( 1.50 ms per token, 667.24 tokens per second)
eval time = 1853.09 ms / 31 tokens ( 59.78 ms per token, 16.73 tokens per second)
total time = 18006.31 ms / 10809 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[DEBUG] <Qwen3-30B-A3B-Instruct-2507-UD-Q6KXL> request /v1/chat/completions - start: 10.254308556s, total: 28.284101405s
[INFO] Request ::1 "POST /v1/chat/completions HTTP/1.1" 200 6885 "OpenAI/Go 1.11.1" 34.847312848s
[DEBUG] Exclusive mode for group (default), stopping other process groups
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 37 | processing task
slot update_slots: id 0 | task 37 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 10826
slot update_slots: id 0 | task 37 | kv cache rm [10792, end)
slot update_slots: id 0 | task 37 | prompt processing progress, n_past = 10826, n_tokens = 34, progress = 0.003141
slot update_slots: id 0 | task 37 | prompt done, n_past = 10826, n_tokens = 34
Non Think is actually working on the file and calling the tools.
From the other thread. A user is reporting that they also had the same issues with the local model, and even after switching to the official API from Alibaba the Coder model is failing in RooCode:
Fwiw, I had the same issues with different quants, swapped over to the official api from alibaba modelstudio and the model was still erroring out in roocode.
Maybe makes sense to get Alibaba / Qwen teams involved? I'm sure they don't want to close their amazing week of releases with a downer. If this is actually an upstream issue.
@JamesMowery Wait which of the following statements is True:
- unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
NO thinking
with new chat template WORKS - unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
YES thinking
with new chat template FAILS - unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF FAILS
- unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF WORKS
- unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF FAILS
- unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF WORKS
If 2. is the case, how are you enabling thinking for unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF? These models don't have reasoning capabilities!
If you are using unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF, that might be another issue
Sorry there are a lot of models so I'm a bit confused sorry!
Everything I'm comparing the Qwen 3 Coder model (Qwen3-Coder-30B-A3B-Instruct-GGUF) to is to the the Qwen 3 Non Think model (Qwen3-30B-A3B-Instruct-2507-GGUF) released this past Monday. The Non Thing model is working perfectly in tool calling for me for everything I throw at it (including RooCode and Crush and also some Agent tests I've been running with Pydantic AI).
I have no idea how think vs non think works. I promise. I'm just using the defaults in RooCode.
Settings in RooCode:
(Only thing I tweaked was context size to match what I put in llama.cpp.)
Also I don't think unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
uses the srv params_from_: Chat format: Hermes 2 Pro
chat template - is there a way to specify using --jinja
directly in Charm Crush?
@JamesMowery
Are you using an updated chat template, or does RooCode not allow customized templates? Rephrase - are you serving this via llama-server
? If so, could you provide the logs for llama-server
?
I have this set in my llama-swap > llama-server (:
"Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL":
cmd: |
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct"
--port ${PORT}
--flash-attn
--threads 16
--gpu-layers 29
--ctx-size 262144
--temp 0.7
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-file /mnt/big/AI/models/chat_template.jinja
--jinja
ttl: 120
"Qwen3-Coder-30B-A3B-Instruct-UD-Q5KXL":
cmd: |
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q5_K_XL.gguf
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct"
--port ${PORT}
--flash-attn
--threads 16
--gpu-layers 30
--ctx-size 196608
--temp 0.7
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-file /mnt/big/AI/models/chat_template.jinja
--jinja
ttl: 120
"Qwen3-Coder-30B-A3B-Instruct-UD-Q6KXL":
cmd: |
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf
--alias "Qwen3-Coder-30B-A3B-Instruct"
--port ${PORT}
--flash-attn
--threads 16
--gpu-layers 33
--ctx-size 65536
--temp 0.7
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-file /mnt/big/AI/models/chat_template.jinja
--jinja
ttl: 120
I'm quite new to all this, so my deepest apologies if I'm doing something wrong.
Ok that looks correct - my confusion is why Roo Code is thinking we're using srv params_from_: Chat format: Hermes 2 Pro
when its a new custom template
Just to be extra thorough, here's absolutely everything I'm getting after I start the server and have it hit from RooCode:
β― llama-swap -listen ":8081"
llama-swap listening on :8081
[DEBUG] Exclusive mode for group (default), stopping other process groups
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> swapState() State transitioned from stopped to starting
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> Executing start command: llama-server -m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf --alias unsloth/Qwen3-Coder-30B-A3B-Instruct --port 5825 --flash-attn --threads 16 --gpu-layers 29 --ctx-size 262144 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 --repeat-penalty 1.05 --cache-type-k q8_0 --cache-type-v q8_0 --chat-template-file /mnt/big/AI/models/chat_template.jinja --jinja, env:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 5994 (c7f3169cd) with gcc-13 (GCC) 13.3.1 20241125 for x86_64-pc-linux-gnu
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 890 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 5825, http threads: 31
main: loading model
srv load_model: loading model '/mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 21415 MiB free
srv log_server_r: request: GET /health 127.0.0.1 503
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> Health check error on http://localhost:5825/health, status code: 503 (normal during startup)
llama_model_loader: loaded meta data with 42 key-value pairs and 579 tensors from /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 27: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 28: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,151387] = ["Δ Δ ", "Δ Δ Δ Δ ", "i n", "Δ t",...
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 37: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 15
llama_model_loader: - kv 40: quantize.imatrix.file str = Qwen3-Coder-30B-A3B-Instruct-GGUF/ima...
llama_model_loader: - kv 41: quantize.imatrix.entries_count u32 = 383
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 292 tensors
llama_model_loader: - type q5_K: 35 tensors
llama_model_loader: - type q6_K: 11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 16.45 GiB (4.63 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Δ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 29 repeating layers to GPU
load_tensors: offloaded 29/49 layers to GPU
load_tensors: CUDA0 model buffer size = 9980.55 MiB
load_tensors: CPU_Mapped model buffer size = 6860.74 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 262144
llama_context: n_ctx_per_seq = 262144
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache_unified: CUDA0 KV buffer size = 7888.00 MiB
llama_kv_cache_unified: CPU KV buffer size = 5168.00 MiB
srv log_server_r: request: GET /health 127.0.0.1 503
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> Health check error on http://localhost:5825/health, status code: 503 (normal during startup)
llama_kv_cache_unified: size = 13056.00 MiB (262144 cells, 48 layers, 1/ 1 seqs), K (q8_0): 6528.00 MiB, V (q8_0): 6528.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: CUDA0 compute buffer size = 1068.50 MiB
llama_context: CUDA_Host compute buffer size = 516.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 270 (with bs=512), 41 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 262144
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 262144
main: model loaded
main: chat template, chat_template: {#- Copyright 2025-present the Unsloth team. All rights reserved. #}
{#- Licensed under the Apache License, Version 2.0 (the "License") #}
{#- Edits made by Unsloth to fix the chat template #}
{% macro render_item_list(item_list, tag_name='required') %}
{%- if item_list is defined and item_list is iterable and item_list | length > 0 %}
{%- if tag_name %}{{- '\n<' ~ tag_name ~ '>' -}}{% endif %}
{{- '[' }}
{%- for item in item_list -%}
{%- if loop.index > 1 %}{{- ", "}}{% endif -%}
{%- if item is string -%}
{{ "`" ~ item ~ "`" }}
{%- else -%}
{{ item }}
{%- endif -%}
{%- endfor -%}
{{- ']' }}
{%- if tag_name %}{{- '</' ~ tag_name ~ '>' -}}{% endif %}
{%- endif %}
{% endmacro %}
{%- if messages[0]["role"] == "system" %}
{%- set system_message = messages[0]["content"] %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = [] %}
{%- endif %}
{%- if system_message is defined %}
{{- "<|im_start|>system\n" + system_message }}
{%- else %}
{%- if tools is iterable and tools | length > 0 %}
{{- "<|im_start|>system\nYou are Qwen, a helpful AI assistant that can interact with a computer to solve tasks." }}
{%- endif %}
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
{{- "\n\nYou have access to the following functions:\n\n" }}
{{- "<tools>" }}
{%- for tool in tools %}
{%- if tool.function is defined %}
{%- set tool = tool.function %}
{%- endif %}
{{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
{{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
{{- '\n<parameters>' }}
{%- for param_name, param_fields in tool.parameters.properties|items %}
{{- '\n<parameter>' }}
{{- '\n<name>' ~ param_name ~ '</name>' }}
{%- if param_fields.type is defined %}
{{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
{%- endif %}
{%- if param_fields.description is defined %}
{{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
{%- endif %}
{{- render_item_list(param_fields.enum, 'enum') }}
{%- set handled_keys = ['type', 'description', 'enum', 'required'] %}
{%- for json_key, json_value in param_fields|items %}
{%- if json_key not in handled_keys %}
{%- set normed_json_key = json_key|string %}
{%- if json_value is mapping %}
{{- '\n<' ~ normed_json_key ~ '>' ~ (json_value | tojson | safe) ~ '</' ~ normed_json_key ~ '>' }}
{%- else %}
{{- '\n<' ~ normed_json_key ~ '>' ~ (json_value | string) ~ '</' ~ normed_json_key ~ '>' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{{- render_item_list(param_fields.required, 'required') }}
{{- '\n</parameter>' }}
{%- endfor %}
{{- render_item_list(tool.parameters.required, 'required') }}
{{- '\n</parameters>' }}
{%- if tool.return is defined %}
{%- if tool.return is mapping %}
{{- '\n<return>' ~ (tool.return | tojson | safe) ~ '</return>' }}
{%- else %}
{{- '\n<return>' ~ (tool.return | string) ~ '</return>' }}
{%- endif %}
{%- endif %}
{{- '\n</function>' }}
{%- endfor %}
{{- "\n</tools>" }}
{{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
{%- endif %}
{%- if system_message is defined %}
{{- '<|im_end|>\n' }}
{%- else %}
{%- if tools is iterable and tools | length > 0 %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- for message in loop_messages %}
{%- if message.role == "assistant" and message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
{{- '<|im_start|>' + message.role }}
{%- if message.content is defined and message.content is string and message.content | trim | length > 0 %}
{{- '\n' + message.content | trim + '\n' }}
{%- endif %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
{%- if tool_call.arguments is defined %}
{%- for args_name, args_value in tool_call.arguments|items %}
{{- '<parameter=' + args_name + '>\n' }}
{%- set args_value = args_value if args_value is string else args_value | string %}
{{- args_value }}
{{- '\n</parameter>\n' }}
{%- endfor %}
{%- endif %}
{{- '</function>\n</tool_call>' }}
{%- endfor %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "user" or message.role == "system" or message.role == "assistant" %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- elif message.role == "tool" %}
{%- if loop.previtem and loop.previtem.role != "tool" %}
{{- '<|im_start|>user\n' }}
{%- endif %}
{{- '<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>\n' }}
{%- if not loop.last and loop.nextitem.role != "tool" %}
{{- '<|im_end|>\n' }}
{%- elif loop.last %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
{#- Copyright 2025-present the Unsloth team. All rights reserved. #}
{#- Licensed under the Apache License, Version 2.0 (the "License") #}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:5825 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /health 127.0.0.1 200
[INFO] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> Health check passed on http://localhost:5825/health
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> swapState() State transitioned from starting to ready
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, n_prompt_tokens = 9288
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.220500
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.440999
slot update_slots: id 0 | task 0 | kv cache rm [4096, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.661499
slot update_slots: id 0 | task 0 | kv cache rm [6144, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.881998
slot update_slots: id 0 | task 0 | kv cache rm [8192, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 9288, n_tokens = 1096, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 9288, n_tokens = 1096
slot release: id 0 | task 0 | stop processing: n_past = 9426, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 8798.92 ms / 9288 tokens ( 0.95 ms per token, 1055.58 tokens per second)
eval time = 6339.65 ms / 139 tokens ( 45.61 ms per token, 21.93 tokens per second)
total time = 15138.57 ms / 9427 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> request /v1/chat/completions - start: 10.255748873s, total: 25.410812037s
[INFO] Request 127.0.0.1 "POST /v1/chat/completions HTTP/1.1" 200 38272 "RooCode/3.25.4" 25.410961396s
[DEBUG] Exclusive mode for group (default), stopping other process groups
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 144 | processing task
slot update_slots: id 0 | task 144 | new prompt, n_ctx_slot = 262144, n_keep = 0, n_prompt_tokens = 9684
slot update_slots: id 0 | task 144 | kv cache rm [9426, end)
slot update_slots: id 0 | task 144 | prompt processing progress, n_past = 9684, n_tokens = 258, progress = 0.026642
slot update_slots: id 0 | task 144 | prompt done, n_past = 9684, n_tokens = 258
^CReceived signal interrupt, shutting down...
[DEBUG] Shutdown() called in proxy manager
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> cmdStopUpstreamProcess() initiating graceful stop of upstream process
Received second interrupt, terminating immediately.
[INFO] Request 127.0.0.1 "POST /v1/chat/completions HTTP/1.1" 200 8718 "RooCode/3.25.4" 1.875323152s
[DEBUG] Exclusive mode for group (default), stopping other process groups
[INFO] Request 127.0.0.1 "POST /v1/chat/completions HTTP/1.1" 502 99 "RooCode/3.25.4" 471.109Β΅s
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> cmd.Wait() returned error: exit status 1
[WARN] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> ExitError >> exit status 1, exit code: 1
[INFO] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> process exited but not StateStopping, current state: ready
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL> stopCommand took 519.122412ms
(I manually killed the server at the end so you can disregard the last few bits.)
Is this what is supposed to happen?
(I'm heading off to bed but I'll respond first thing tomorrow if there's any further responses / requests. Happy to help best I can!)
Oh actually I can confirm on my side as well that Hermes is used - I think this is just for parsing the json arguments, so I think that's ok
@JamesMowery Oh no worries - thanks for debugging! So the main issue is the infinite repetitions correct? Ie the model keeps going in a loop on tool calling with no end in sight - is this correct?
Ok that looks correct - my confusion is why Roo Code is thinking we're using
srv params_from_: Chat format: Hermes 2 Pro
when its a new custom template
@danielhanchen
In the llama.cpp server (which I believe Roo Code is using because of the log message), llama.cpp just parses the generated Jinja template output, and if it sees the <tool_call>
token, it assumes itβs using Hermes 2 Pro and then starts to parse the call internally (which is failing).
Here is the related llama.cpp code, where you can see this "matching" logic in action:
// common/chat.cpp
// ...
tatic common_chat_params common_chat_templates_apply_jinja(
const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs)
{
templates_params params;
params.tools = common_chat_tools_to_json_oaicompat<json>(inputs.tools);
const auto & tmpl = params.tools.is_array() && tmpls->template_tool_use
? *tmpls->template_tool_use
: *tmpls->template_default;
const auto & src = tmpl.source();
const auto & caps = tmpl.original_caps();
params.messages = common_chat_msgs_to_json_oaicompat<json>(inputs.messages, /* concat_text= */ !tmpl.original_caps().requires_typed_content);
params.add_generation_prompt = inputs.add_generation_prompt;
params.tool_choice = inputs.tool_choice;
params.enable_thinking = inputs.enable_thinking;
params.grammar = inputs.grammar;
params.now = inputs.now;
params.extra_context = json::object();
for (auto el : inputs.chat_template_kwargs) {
params.extra_context[el.first] = json::parse(el.second);
}
if (!inputs.json_schema.empty()) {
params.json_schema = json::parse(inputs.json_schema);
}
if (inputs.parallel_tool_calls && !tmpl.original_caps().supports_parallel_tool_calls) {
LOG_DBG("Disabling parallel_tool_calls because the template does not support it\n");
params.parallel_tool_calls = false;
} else {
params.parallel_tool_calls = inputs.parallel_tool_calls;
}
if (params.tools.is_array()) {
if (params.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE && !params.grammar.empty()) {
throw std::runtime_error("Cannot specify grammar with tools");
}
if (caps.supports_tool_calls && !caps.supports_tools) {
LOG_WRN("Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.\n");
}
}
// DeepSeek R1: use handler in all cases except json schema (thinking / tools).
if (src.find("<ο½toolβcallsβbeginο½>") != std::string::npos && params.json_schema.is_null()) {
return common_chat_params_init_deepseek_r1(tmpl, params);
}
// Command R7B: : use handler in all cases except json schema (thinking / tools).
if (src.find("<|END_THINKING|><|START_ACTION|>") != std::string::npos && params.json_schema.is_null()) {
return common_chat_params_init_command_r7b(tmpl, params);
}
// Hermes 2/3 Pro, Qwen 2.5 Instruct (w/ tools)
if (src.find("<tool_call>") != std::string::npos && params.json_schema.is_null()) {
return common_chat_params_init_hermes_2_pro(tmpl, params);
}
// Use generic handler when mixing tools + JSON schema.
// TODO: support that mix in handlers below.
if ((params.tools.is_array() && params.json_schema.is_object())) {
return common_chat_params_init_generic(tmpl, params);
}
// Functionary prepends "all\n" to plain content outputs, so we use its handler in all cases.
if (src.find(">>>all") != std::string::npos) {
return common_chat_params_init_functionary_v3_2(tmpl, params);
}
// Firefunction v2 requires datetime and functions in the context even w/o tools, so we also use its handler in all cases.
if (src.find(" functools[") != std::string::npos) {
return common_chat_params_init_firefunction_v2(tmpl, params);
}
// Functionary v3.1 (w/ tools)
if (src.find("<|start_header_id|>") != std::string::npos
&& src.find("<function=") != std::string::npos) {
return common_chat_params_init_functionary_v3_1_llama_3_1(tmpl, params);
}
// Llama 3.1, 3.2, 3.3 (also requires date_string so using it even w/o tools)
if (src.find("<|start_header_id|>ipython<|end_header_id|>") != std::string::npos) {
auto allow_python_tag_builtin_tools = src.find("<|python_tag|>") != std::string::npos;
return common_chat_params_init_llama_3_x(tmpl, params, allow_python_tag_builtin_tools);
}
// Plain handler (no tools)
if (params.tools.is_null() || inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_NONE) {
return common_chat_params_init_without_tools(tmpl, params);
}
// Mistral Nemo (w/ tools)
if (src.find("[TOOL_CALLS]") != std::string::npos) {
return common_chat_params_init_mistral_nemo(tmpl, params);
}
// Generic fallback
return common_chat_params_init_generic(tmpl, params);
}
So, if the Jinja chat template inside GGUF is valid but not compatible with "Hermes 2 Pro," then the internal parsing in llama.cpp should be fixed.
Main issues with RooCode:
- Near infinite looping and going insanely slow.
- Getting the message: "Roo tried to use write_to_file without value for required parameter 'path'. Retrying..." (usually twice in a row, example below failed 3 times in a row)
- Getting the message: Roo is having trouble... This may indicate a failure in the model's thought process or inability to use a tool properly, which can be mitigated with some user guidance (e.g. "Try breaking down the task into smaller steps"). after an extended period of time of nothing happening.
One more quick demo before I head off to bed:
My prompt: "Analyze this project and write a summary in SUMMARY.md"
Main issues with Charm Crush:
- Limited tool calling.
- Complete crashing (whereas the Non Think model works near perfectly)
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q5KXL> request /v1/chat/completions - start: 0s, total: 1.368125017s
[INFO] Request ::1 "POST /v1/chat/completions HTTP/1.1" 200 1984 "OpenAI/Go 1.11.1" 1.368240367s
slot launch_slot_: id 0 | task 2247 | processing task
slot update_slots: id 0 | task 2247 | new prompt, n_ctx_slot = 196608, n_keep = 0, n_prompt_tokens = 10792
slot update_slots: id 0 | task 2247 | kv cache rm [3, end)
slot update_slots: id 0 | task 2247 | prompt processing progress, n_past = 2051, n_tokens = 2048, progress = 0.189770
slot update_slots: id 0 | task 2247 | kv cache rm [2051, end)
slot update_slots: id 0 | task 2247 | prompt processing progress, n_past = 4099, n_tokens = 2048, progress = 0.379540
slot update_slots: id 0 | task 2247 | kv cache rm [4099, end)
slot update_slots: id 0 | task 2247 | prompt processing progress, n_past = 6147, n_tokens = 2048, progress = 0.569311
slot update_slots: id 0 | task 2247 | kv cache rm [6147, end)
slot update_slots: id 0 | task 2247 | prompt processing progress, n_past = 8195, n_tokens = 2048, progress = 0.759081
slot update_slots: id 0 | task 2247 | kv cache rm [8195, end)
slot update_slots: id 0 | task 2247 | prompt processing progress, n_past = 10243, n_tokens = 2048, progress = 0.948851
slot update_slots: id 0 | task 2247 | kv cache rm [10243, end)
slot update_slots: id 0 | task 2247 | prompt processing progress, n_past = 10792, n_tokens = 549, progress = 0.999722
slot update_slots: id 0 | task 2247 | prompt done, n_past = 10792, n_tokens = 549
slot release: id 0 | task 2247 | stop processing: n_past = 10817, truncated = 0
slot print_timing: id 0 | task 2247 |
prompt eval time = 12346.87 ms / 10789 tokens ( 1.14 ms per token, 873.82 tokens per second)
eval time = 1304.98 ms / 26 tokens ( 50.19 ms per token, 19.92 tokens per second)
total time = 13651.85 ms / 10815 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q5KXL> request /v1/chat/completions - start: 0s, total: 15.020912495s
[INFO] Request ::1 "POST /v1/chat/completions HTTP/1.1" 200 5343 "OpenAI/Go 1.11.1" 15.021032385s
[DEBUG] Exclusive mode for group (default), stopping other process groups
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 2283 | processing task
slot update_slots: id 0 | task 2283 | new prompt, n_ctx_slot = 196608, n_keep = 0, n_prompt_tokens = 10998
slot update_slots: id 0 | task 2283 | kv cache rm [10792, end)
slot update_slots: id 0 | task 2283 | prompt processing progress, n_past = 10998, n_tokens = 206, progress = 0.018731
slot update_slots: id 0 | task 2283 | prompt done, n_past = 10998, n_tokens = 206
[New LWP 1031385]
[New LWP 1031384]
[New LWP 1031383]
[New LWP 1031382]
[New LWP 1031381]
[New LWP 1031380]
[New LWP 1031379]
[New LWP 1031378]
[New LWP 1031377]
[New LWP 1031376]
[New LWP 1031375]
[New LWP 1031374]
[New LWP 1031373]
[New LWP 1031372]
[New LWP 1031371]
[New LWP 1031361]
[New LWP 1031360]
[New LWP 1031359]
[New LWP 1031358]
[New LWP 1031357]
[New LWP 1031356]
[New LWP 1031355]
[New LWP 1031354]
[New LWP 1031353]
[New LWP 1031352]
[New LWP 1031351]
[New LWP 1031350]
[New LWP 1031349]
[New LWP 1031348]
[New LWP 1031347]
[New LWP 1031346]
[New LWP 1031345]
[New LWP 1031344]
[New LWP 1031343]
[New LWP 1031342]
[New LWP 1031341]
[New LWP 1031340]
[New LWP 1031339]
[New LWP 1031338]
[New LWP 1031337]
[New LWP 1031336]
[New LWP 1031335]
[New LWP 1031334]
[New LWP 1031333]
[New LWP 1031332]
[New LWP 1031331]
[New LWP 1031330]
[New LWP 1031329]
[New LWP 1031328]
[New LWP 1031327]
[New LWP 1031324]
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f66476b2742 in ?? () from /usr/lib/libc.so.6
#0 0x00007f66476b2742 in ?? () from /usr/lib/libc.so.6
#1 0x00007f66476a5eec in ?? () from /usr/lib/libc.so.6
#2 0x00007f664772825b in wait4 () from /usr/lib/libc.so.6
#3 0x00007f6647eadd3f in ggml_print_backtrace () from /usr/lib/libggml-base.so
#4 0x00007f6647ec36c0 in ?? () from /usr/lib/libggml-base.so
#5 0x00007f6647ab3c6c in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
warning: 48 /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc: No such file or directory
#6 0x00007f6647a94644 in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:102
102 in /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc
#7 0x00007f6647ab3f78 in __cxxabiv1::__cxa_throw (obj=0x55b5d7d936e0, tinfo=0x7f6647cd65e0 <typeinfo for std::runtime_error>, dest=0x7f6647ad67b0 <std::runtime_error::~runtime_error()>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
warning: 98 /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory
#8 0x000055b5c785ebcf in ?? ()
#9 0x000055b5c78baaae in ?? ()
#10 0x000055b5c78c4d22 in ?? ()
#11 0x000055b5c78cbed4 in ?? ()
#12 0x000055b5c78b5899 in ?? ()
#13 0x000055b5c788d3fe in ?? ()
#14 0x00007f6647627b8b in ?? () from /usr/lib/libc.so.6
#15 0x00007f6647627c3b in __libc_start_main () from /usr/lib/libc.so.6
#16 0x000055b5c7891195 in ?? ()
[Inferior 1 (process 1031322) detached]
terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid diff: 'I'll analyze this project and create a summary. Let me first check the README to understand what this project is about.
<tool_call>
<function=view' not found at start of 'I'll analyze this project and create a summary. Let me first check the README to understand what this project is about.
<tool_call>
'
[INFO] Request ::1 "POST /v1/chat/completions HTTP/1.1" 200 7660 "OpenAI/Go 1.11.1" 10.262253765s
[DEBUG] <Qwen3-Coder-30B-A3B-Instruct-UD-Q5KXL> cmd.Wait() returned error: signal: aborted (core dumped)
[WARN] <Qwen3-Coder-30B-A3B-Instruct-UD-Q5KXL> ExitError >> signal: aborted (core dumped), exit code: -1
[INFO] <Qwen3-Coder-30B-A3B-Instruct-UD-Q5KXL> process exited but not StateStopping, current state: ready
Actually called the "List" tool, but crashed right after.
The biggest core issue: Non Think model released on Monday does all things RooCode and tool calling significantly better, almost flawlessly, compared to Coder model released today, and it's not even close. If someone told me that there was an innocent mix up and the model released today was actually the Non Think model and the model released on Monday was actually the Coder model... I'd believe it. (Not necessarily from the programming output perspective, because I haven't had a chance to test today's model in actual coding because of the errors, but just from the interactions with all the coding tools and tool calling and how much better it is with Non Think.)
Have you rebuilt the model with the chat template fix? It's actually easier for me to just download the whole thing again - since ramalama runs the model in a container, I can't figure out how to get the updated chat template file into the container for it to use, I can't find any options to make it mount a file into the container. But I have a very fast internet connection so I can redownload the model in about three minutes. :P
Just a +1 on the error/crash still happening using llama.cpp + sst/opencode with the new template file. It's a different error/crash from the "old" baked-in template.
terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid diff: 'I'll create a snake game in C using ncurses. Let me first check what files exist to understand the project structure.
<tool_call>
<function=list' not found at start of 'I'll create a snake game in C using ncurses. Let me first check what files exist to understand the project structure.
<tool_call>
'
EDIT: Using the template from unsloth/Qwen3-30B-A3B-Instruct-2507 seems to work for a few calls but then crashes with:
terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid diff: now finding less tool calls!
It seems at Ollama they accepted the reality for now lol https://ollama.com/library/qwen3-coder:30b/blobs/c6a614465b37
Maybe we should just wait for Qwen to communicate on what is happening?
I've just monkey-patched llama.cpp's Hermes 2 parsing source code to match the Jinja template inside the GGUF file (the original Jinja template from the GGUF file uploaded ~18 hours ago).
https://github.com/cinu/llama.cpp/commit/26aa6535656bb33cea2bb907567ebd80e3d3b069
It "works" - at least, it's not crashing. I'm testing it with simple tools and simple arguments, so it may break with more advanced definitions.
You can test it by cloning with:
git clone -b qwen3-coder https://github.com/cinu/llama.cpp
Or, just replace common_chat_params_init_hermes_2_pro()
and common_chat_parse_hermes_2_pro()
definitions in common/chat.cpp
with my fixed versions from: https://raw.githubusercontent.com/cinu/llama.cpp/refs/heads/qwen3-coder/common/chat.cpp
Please note that this is just a quick-and-dirty patch for debugging, not an official implementation.
I've just monkey-patched llama.cpp's Hermes 2 parsing source code to match the Jinja template inside the GGUF file (the original Jinja template from the GGUF file uploaded ~18 hours ago).
https://github.com/cinu/llama.cpp/commit/26aa6535656bb33cea2bb907567ebd80e3d3b069
It "works" - at least, it's not crashing. I'm testing it with simple tools and simple arguments, so it may break with more advanced definitions.
You can test it by cloning with:
git clone -b qwen3-coder https://github.com/cinu/llama.cpp
Or, just replace
common_chat_params_init_hermes_2_pro()
andcommon_chat_parse_hermes_2_pro()
definitions incommon/chat.cpp
with my fixed versions from: https://raw.githubusercontent.com/cinu/llama.cpp/refs/heads/qwen3-coder/common/chat.cppPlease note that this is just a quick-and-dirty patch for debugging, not an official implementation.
Did you open issue on llama.cpp github. If not can you kindly do that
It seems at Ollama they accepted the reality for now lol https://ollama.com/library/qwen3-coder:30b/blobs/c6a614465b37
Maybe we should just wait for Qwen to communicate on what is happening?
Actually, the simple template they set, without tools, works pretty great with Roocode! Fails from time to time but works most of it.
It seems at Ollama they accepted the reality for now lol https://ollama.com/library/qwen3-coder:30b/blobs/c6a614465b37
Can also confirm that this Ollama template is working way better.
RooCode for my prompt:
Analyze this project and write a summary in SUMMARY.md
Actually did multiple read file operations at the same time which I don't think I've ever seen before on RooCode.
Charm Crush also working a bit better for the same prompt but with issues still:
Sadly, it still ended in an infinite loop, but at least it did some things:
Might have spoke too soon about the Ollama template. After a bit further testing, with the Ollama jinja template, I'm still seeing a lot of infinite looping even in RooCode (and saw it above in Crush).
Still way worse than the Non Think model (or at least the Non Think model's template if that's relevant...).
Back to that "Roo tried to use write_to_file without value for required parameter 'path'. Retrying..."
Stuck looping for another 10+ minutes
More errors...
I'm 100% assuming this is going to end with the "Roo is having trouble...
This may indicate a failure in the model's thought process or inability to use a tool properly, which can be mitigated with some user guidance (e.g. "Try breaking down the task into smaller steps")."
Update: Another 10+ minutes later and it's still thinking/looping, so I'm just going to kill it.
Last Update: Right before I clicked cancelled it fully errored out:
Is there a way to view the jinja template for the Non Think model released earlier this week?
I honestly just want to try and test it with that and see what happens (because I'm still so very impressed with how the Non Think model is performing in RooCode and Crush). Maybe this could help narrow down if the template is the issue or not?
I think I found the jinja template for the Non Think model (please let me know if this is not it): https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF/blob/main/template
RooCode Query: "Analyze this project and write a summary in SUMMARY.md"
Got this right.
RooCode Query: "Create a rampart game in the rampart.py file using Pygame. Only read, reference, and use the rampart.py file."
Sadly it's failing. Not using tools and writing everything in the chat window.
For Charm Crush the query: "Analyze this project and write a summary in SUMMARY.md"
Calling lots of tools
Almost to the finish line. Now doing a lot of stuff with git...
Sadly... infinite loop at the end.
--
I know these tests aren't exactly scientific (assuming that the Jinja template above is the actual one from the Non Think model), but to me it just seems that the Coder model is simply not performing as well as the Non Think model at this point.
Unless there's some incredible magic sauce missing in this jinja template stuff to make it magically come together perfectly... it's looking more and more like Coder is a tool-calling and coding regression from Non Think model (at least in terms of RooCode + Crush + OpenCode integration). It's really hard to believe, but I'm starting to lose hope.
Maybe it works great in the Qwen Coder tool they released (I haven't tested it) and maybe they potentially overfocused on that? Not sure. Hopefully someone from Alibaba / Qwen team could provide guidance here on why Non Think model does a better job with coding tools than Coder model.
In the meantime, I would actually recommend people play around with Non Think (Qwen3-30B-A3B-Instruct-2507-GGUF) model. It works really well...
Actually completed the job in Crush without erroring out or looping, and works way faster than Coder at the same quants.
I guess we'll just have to wait until llama.cpp implements a proper qwen3-coder parser or the Qwen team switching back to a known tool calling format; see https://github.com/ggml-org/llama.cpp/issues/15012 and https://github.com/ggml-org/llama.cpp/issues/14915
Sorry back! Yes I'm being discussing on llama.cpp as well - the new chat template at https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/raw/main/chat_template.jinja removes .keys()
so it should be equivalent to https://github.com/ggml-org/llama.cpp/issues/15012. However https://github.com/ggml-org/llama.cpp/issues/14915 should be added since that parses XML for Coder much better!
@JamesMowery Thanks for more tests! It's possible the tool calling dynamics for Coder are different for Qwen3 Coder, so I think maybe it might be related but unsure as of yet
Someone made a llama.cpp PR with a potential qwen3-coder parser @ https://github.com/ggml-org/llama.cpp/pull/15019. Haven't tried it myself yet but seems promising.
@danielhanchen the consensus at https://github.com/ggml-org/llama.cpp/pull/15019 seems to be that all GGUF quants are broken, FYI. Tried a few from different people and they all break. The parser from that PR seems to be working though.
tried llama.cpp with PR https://github.com/ggml-org/llama.cpp/pull/14962
and Chat template:
https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/raw/main/chat_template.jinja
tool calling for both Cline and qwen-coder cli are working well.
Can confirm that https://github.com/ggml-org/llama.cpp/pull/14962 (now merged in master) makes everything work but the model still behaves weird. Like printing out the tool calls, switching randomly between JSON and XML, etc.
Got the new llamacpp update (b6067) with the fixes! Testing with the different templates I've seen listed in the various discussions here. Figured I'd test it live and document the results.
TL;DR to save you time: Things are still bad (and even worse in terms of speed) with the newest update.
Here's the settings I ran with:
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q5_K_XL.gguf
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct"
--port ${PORT}
--flash-attn
--threads 16
--gpu-layers 30
--ctx-size 196608
--temp 0.7
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-file <TEMPLATE PATH HERE>
--jinja
ttl: 120
For reference on the two prompts I used (not saying these are good prompts, but it maintains consistency of all my prior testing):
- First Prompt: "Analyze this project and write a summary in SUMMARY.md"
- Second Prompt: "Create a rampart game in the rampart.py file using Pygame. Only read, reference, and use the rampart.py file."
Alt Template (forgot who posted this in one of the prior discussions)
{{- $lastUserIdx := -1 -}}
{{- range $idx, $msg := .Messages -}}
{{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
{{- end }}
{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
...
...
...
(Cut to save character limit)
First Prompt Results:
Errored. I gave up here and didn't bother continuing testing with the second prompt.
Ollama Template
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
First Prompt Results:
Success in making the summary; didn't use todo list but maybe was not needed.
Second Prompt Results:
Infinite looping behavior still early on.
Not looking good after 10 minutes of running. Assuming it will be a complete fail.
Complete fail. Gave up on further testing.
Unsloth Template: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/raw/main/chat_template.jinja
First Prompt Results:
Success in making the summary; didn't use todo list but maybe was not needed.
Second Prompt Results:
Off to a good start it seems, but can it get past the first hurdle of actually creating the file?
Unfortunately my hopes were too high. Infinite looping right after the first step for about 3 - 4 minutes. Eventually shown the first error.
It's asking to run a different command to read an empty file.
Very odd behavior. I'm assuming the tool calls to read the file are just failing and it's asking me to run the commands manually.
It's doing something weird in the chat where it is writing the code/diff in the chat and not writing it normally in the code editor. Running EXTREMELY slowly... worst speeds yet. (Been at this for like 15 minutes, whereas Non Think knocks this out in about 3 - 4 minutes worst case scenario).
It gave up on writing many minutes in. Now asking me to echo the file out manually...
Now it's trying to write the code through the python interpreter directly?
It's way too weird. Decided to end it here.
UPDATE: Decided to run the very last test again just in case maybe it was an RNG thing ... still getting errors right off the start:
Thoughts after another round of testing with the latest llamacpp version:
- The Qwen3 Non Think model handled all of these prompts without fail and many times faster. The Qwen3 Coder model is just not working correctly still... for whatever reason that might be. And now it's running a bit slower than before for me with this new update (assuming because of the infinite looping).
- Maybe there's an issue with my llamacpp setup specifically for coder (I posted it above if anyone wants to make a suggestion though)? Confirmed the templates were being used in logs though. Still Qwen3 Non Think models gets it perfectly, one-shot, with similar settings (just no alternative template).
There are 49 layers in this model, surely it will be slow if only offload 30 GPU layers.
Just tried the 2 prompts and working well in my llama.cpp + Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf + Cline
SET LLAMA_CPP_PATH=G:\ai\llama.cpp\build\bin\Release
SET LLAMA_ARG_HOST=0.0.0.0
SET LLAMA_ARG_PORT=8080
SET LLAMA_ARG_JINJA=true
SET LLAMA_ARG_FLASH_ATTN=true
SET LLAMA_ARG_CACHE_TYPE_K=q8_0
SET LLAMA_ARG_CACHE_TYPE_V=q8_0
SET LLAMA_ARG_N_GPU_LAYERS=999
SET LLAMA_ARG_NO_MMAP=1
SET LLAMA_ARG_CTX_SIZE=200000
SET LLAMA_ARG_MODEL=Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
cd G:\ai\models\unsloth\Qwen3-Coder-30B-A3B-Instruct-GGUF
llama-server.exe ^
--temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 --repeat-penalty 1.05 ^
--swa-full ^
--jinja ^
--chat-template-file G:\ai\models\Qwen3-Coder-30B-A3B.chat.txt
My concern isn't the speed though. I'm just giving feedback that now it's SLOWER than before.
The problem that should be the focus is that the output is the same, if not worse. It's still not calling tools.
To be clear, I've also tested lower quants with full offloading to GPU throughout this process. No change (obviously it's faster, but I don't care about the speed).
I'll try to see if I can match your settings though and see what happens!
Deleting this as I was using the wrong template for this specific test. Rerunning tests.
New settings:
"Qwen3-Coder-30B-A3B-Instruct-Q4KM":
cmd: |
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
--port ${PORT}
--flash-attn
--gpu-layers 999
--ctx-size 200000
--temp 0.7
--no-mmap
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-file /mnt/big/AI/models/unsloth_template.jinja
--jinja
--swa-full
Okay so I'm noticing something maybe important here:
If I run with --gpu-layers 25
I get this... lots of errors per usual, almost immediately.
If I modify it with --gpu-layers 999
(to run it entirely on GPU) it's working a bit better in theory, but I'm still getting the infinite loops (stuck here for ~10 minutes so far). Eventually got the error.
Why would the --gpu-layers impact the output here (both still failures, but still different)? I thought this just impacted speed alone? Or maybe is this just up to RNG at this point?
Still not getting a positive outcome regardless even when matching kiuckhuang's settings best I can.
@kiuckhuang To confirm this is the template you are using? https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/raw/main/chat_template.jinja
I've noticed when I use @kiukhuang settings I get the following error with the chat template no matter what I try doing.
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 262144
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
common_chat_templates_init: failed to parse chat template (defaulting to chatml): Expected value expression at row 1, column 4:
{{- $lastUserIdx := -1 -}}
^
{{- range $idx, $msg := .Messages -}}
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 262144
main: model loaded
main: chat template, chat_template: {%- for message in messages -%}
{{- '<|im_start|>' + message.role + '
' + message.content + '<|im_end|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|im_start|>assistant
' -}}
{%- endif -%}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:5826 - starting the main loop
No idea if someone can help me figure out what's wrong here. Here's the new settings I'm trying:
"Qwen3-Coder-30B-A3B-Instruct-Q4KM":
cmd: |
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct"
--port ${PORT}
--flash-attn
--gpu-layers 999
--ctx-size 200000
--temp 0.7
--no-mmap
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--swa-full
--chat-template-file /mnt/big/AI/models/unsloth_template.jinja
--jinja
ttl: 120
Does the order of the parameters matter maybe? Does that --swa-full
mess things up perhaps?
And can someone confirm the exact chat template we should be using as of right now?
@JamesMowery
nah, offloading layers to GPU shouldn't affect the accuracy at all since the exact same operations are done
But try removing
--cache-type-k q8_0
--cache-type-v q8_0
I know it can have an impact on small models, I already faced repetition and so on with q8_0, and as this one is an moe, I suspect this could really make the things bad, but I'll let you try ;)
flash attn can also have an impact on output, but I think the main culprit is the above
PS: I really don't know what swa-full is doing!
@JamesMowery
yes, I am using https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/raw/main/chat_template.jinja
And only tested tools calling working for Cline + Qwen coder CLI, did not use roocode so far.
Update: I might figured out what the problem is... I think?
I accidentally duplicated my Q4 UD quant entry twice in llama-swap config (when I was trying to duplicate the new settings), and I think that made everything freak out. It seemingly was interfering not only with the performance of the Q4 quant, but also the Q5 and Q6 quants as well for reasons I don't understand.
I got that sorted. I'm rerunning the tests right now.
That's still bad for me too... As bold84 said there https://github.com/ggml-org/llama.cpp/pull/15019#issuecomment-3146502618
The more stuff is in the context window, the less likely tool calls succeed.
But one thing is sure, the 30B model isn't strong with tool calls when there's 30k+ tokens in the context window.
I guess we are all facing the same. @JamesMowery just in case I don't know if you tried to disable the MCP functionality in Roo but it can at least save some tokens (not that much), but maybe can make the LM less confused?
Devstral is slower, but still my main for agentic coding :)
Rerunning the tests after fixing this tiny error in my llama-swap. Recreated it from scratch just to be extra sure.
I also double checked to confirm all the models load the latest chat template without error, which they did.
Setup for each
"Qwen3-Coder-30B-A3B-Instruct-UD-Q4KXL":
cmd: |
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct"
--port ${PORT}
--flash-attn
--threads 16
--gpu-layers 30
--ctx-size 196608
--temp 0.7
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-file /mnt/big/AI/models/unsloth_template.jinja
--jinja
ttl: 120
"Qwen3-Coder-30B-A3B-Instruct-UD-Q5KXL":
cmd: |
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q5_K_XL.gguf
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct"
--port ${PORT}
--flash-attn
--threads 16
--gpu-layers 30
--ctx-size 196608
--temp 0.7
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-file /mnt/big/AI/models/unsloth_template.jinja
--jinja
ttl: 120
"Qwen3-Coder-30B-A3B-Instruct-UD-Q6KXL":
cmd: |
llama-server
-m /mnt/big/AI/models/llamacpp/Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf
--alias "unsloth/Qwen3-Coder-30B-A3B-Instruct"
--port ${PORT}
--flash-attn
--threads 16
--gpu-layers 30
--ctx-size 65536
--temp 0.7
--top-k 20
--top-p 0.8
--min-p 0.0
--repeat-penalty 1.05
--cache-type-k q8_0
--cache-type-v q8_0
--chat-template-file /mnt/big/AI/models/unsloth_template.jinja
--jinja
ttl: 120
- Using the https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/raw/main/chat_template.jinja:
- First Prompt: "Analyze this project and write a summary in SUMMARY.md"
- Second Prompt: "Create a rampart game in the rampart.py file using Pygame. Only read, reference, and use the rampart.py file."
Q4 UD Quant
Prompt 1:
Success on making the summary
Prompt 2:
Success on creating the game
Q5 UD Quant
Prompt 1:
Failed:
Prompt 2:
Failed (not using any tools and writing all the code in the chat):
Q6 UD Quant
Prompt 1:
Failed:
Prompt 2:
Failed (doing the same old thing where it's going to error out after about 10 more minutes, and it's looping over and over):
Yeah the latest llama.cpp build with that PR included and the fixed template from unsloth "works" but the model/quants are a bit meh right now. Non-coding non-thinking 2507 30B felt a lot better in my very simple testing. Hoping for a significant fix from Unsloth and/or Qwen. π€
Decided to re-run the exact same test on the Q4 quant to see if I could replicate the success (as it was the first time ever, wanted to confirm it's not a fluke).
It's going psycho on second prompt and stuck in a loop of... not... calling the tools but thinking it is I think:
It's still just worse than Non Thinking, as marceldev89 mentioned. I'll wait for any major updates and re-run the tests again if we get something new and exciting.
For now ... :(
(But big shout out to Unsloth and Qwen for the earlier releases. Non Think is absolutely amazing!)
@JamesMowery
you are going a bit fast! did you try removing --cache-type-k q8_0
and --cache-type-v q8_0
after you sorted out your template issue? because in the commands you shared it's there again
@owao Happy to give it a shot. Although I've had zero issues with Non Think with the same exact setup. Not really expecting this to be a make or break situation, but I'll give it one more shot.
(I'm also in the process of downloading Qwen-Code, so I'll have a play around with that.)
Yeah I agreed that would be inconsistent, you are right, but that's just things aren't that bad on my side... until I reach those 30K+ tokens!
Testing with removing the cache-type
params.
Q4 UD Quant
Prompt 1: Success
Prompt 2: Failed (infinite looping; since it's running way way slower, not going to wait for the full failure loop)
Q5 UD Quant
Prompt 1: Failed
Prompt 2: Not running since the first prompt is a way simpler task.
Q6 UD Quant
Prompt 1: Failed
Prompt 2: Not running since the first prompt is a way simpler task.
Sadly, still a very poor performance without the cache-type params set.
--
I'm going to play around with Qwen-Code (hopefully you can use it with the local model) and see if that gives any better results.
Testing in qwen-code... hoping for a better outcome!
Prompt 1:
Not sure if I'm supposed to be seeing that html-like and json code.
Keeps getting worse...
Seemed like it was maybe going to do it... but eventually gave up and failed.
Darn, I was really hoping qwen-code would at least give me some hope. But if it's not working in qwen-code, then I mean what's the odds it is ever going to work outside of it.
Out of curiosity, I had to try Qwen3 Non Thinking Model in qwen-coder:
Prompt 2:
And...
Perfect!
It's not rampart at all (probably referenced the Flappy Bird game I had in the directory), but at least it's a game and it works and has an end screen?
Just absolutely destroys Qwen3 Coder with everything relating to tool calling.
Someone please hack the code to inject the coding knowledge of Qwen3 Coder into the brains of Qwen3 Non Think and call it a job well done! :)
At this moment, how to fix this?
Latest llama.cpp build, latest unsloth quant gguf. Error from first message. =(
At this moment, how to fix this?
Latest llama.cpp build, latest unsloth quant gguf. Error from first message. =(
Make sure to also use the updated template from https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507/blob/main/chat_template.jinja.
Thank you!
Could you provide an example where this is falling - that would be very helpful thank you!
Is there any work being done by qwen team or you guys to improve tool calling issues or is it expected to download gguf feom here and get jinja from instruct version https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507/blob/main/chat_template.jinja and use qwen coder only