Qwen3 not Using Tools in Complex Prompts Unlike QwQ-32B

#20
by Anaudia - opened

I previously used QwQ-32B via Qwen-Agent and everything ran smoothly. However, when I use the same prompt and setup with Qwen3-32B or Qwen3-14B, the model often responds that it cannot use any tools and will instead make an educated guess.

This behavior may be related to my prompting approach. I use a large system prompt and typically work with relatively long input texts (around 8k tokens). It would be very helpful to know what setup was used for the BCFL benchmark or, more generally, during training for tool usage.

Could you share any best practices for prompting in tool-use scenarios beyond the very simple example currently provided? I imagine many users are looking to apply this model in more complex tool-using contexts, especially since it is specifically promoted for that purpose.

Thank you for your feedback. If you're open to sharing your prompt (or a simplified version), we'd be happy to review it and provide targeted suggestions for improving tool activation in Qwen3-14B/32B.
We’re also working on a cookbook outlining best practices for tool calling in Qwen3, which we plan to release soon.

To better understand and address your problem, are you currently using the default settings for tool calling via Qwen-Agent, and are you running in think or nothink mode?

Thank you so much for your response; I really appreciate you taking the time!

Implementation Details

My implementation is based on the example shared for QwQ 32B on the Qwen-Agent GitHub: assistant_qwq.py.

I use vLLM for inference with the following command:

vllm serve /models/Qwen3-14B \
  --port 8000 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1

So I believe I use think mode or? think mode is enabled by using enable-reasoning, right?


I work on medical coding. Below you can find the system and user prompts I use, as well as the custom tool.

System Prompt

Your task is to analyze medical discharge reports, identify the primary diagnosis, and code it according to ICD-10-GM guidelines.

## ICD-10-GM Guidelines for Determining the Primary Diagnosis

The primary diagnosis follows the World Health Organization (WHO) definition:

> “The condition that is established at the end of the hospital stay as the diagnosis and was the main reason for the treatment and examination of the patient.”

At discharge, identify the condition that should be considered the primary diagnosis. The diagnosis mentioned first in the report is not necessarily the primary diagnosis.

### Selection When Multiple Diagnoses Qualify

If multiple conditions meet the primary diagnosis definition, select the one requiring the most medical resources. Resource consumption is determined by medical services (physician and nursing services, surgeries, medical products, etc.) rather than the cost weight of the case.

## Examples to Guide Your Analysis

**Example 1**  
A female patient is admitted for a keratoplasty and undergoes surgery. On the second day, she is transferred to the ICU due to a heart attack, and a coronary angiography with stent placement is performed.  
→ **Primary diagnosis:** Acute myocardial infarction.

**Example 2**  
A patient with decompensated heart failure due to a pre-existing atrial septal defect and chronic venous insufficiency of the lower extremities with ulceration is admitted. Heart failure is treated, VAC therapy is performed on the legs, and in the second week a percutaneous ASD closure with an Amplatzer device is performed.  
→ **Primary diagnosis:** Atrial septal defect (ASD) closure.

**Example 3**  
A patient is hospitalized for 12 days to treat uncontrolled diabetes mellitus. One day before discharge, a phimosis surgery is performed.  
→ **Primary diagnosis:** Uncontrolled diabetes mellitus.

**Example 4**  
A patient is hospitalized for a bleeding gastric ulcer. Endoscopic hemostasis is performed and two units of erythrocyte concentrate are transfused.  
→ **Primary diagnosis:** Bleeding gastric ulcer.

**Example 5 – Psychiatry**  
A patient is admitted for a severe depressive episode. During treatment, harmful alcohol use is noted, and diabetes mellitus is diagnosed and managed.  
→ **Primary diagnosis:** Severe depressive episode.

## The ICD-10-GM Systematic Directory

**Important:** The ICD-10-GM codes for 2025 differ substantially from previous ICD editions.

### Chapter 1: Certain Infectious and Parasitic Diseases
- **1.1**: Infectious intestinal diseases  
- **1.2**: Tuberculosis  
- **1.3**: Certain bacterial zoonoses  
- **1.4**: Other bacterial diseases  
- **1.5**: Infections predominantly transmitted by sexual contact  
- **1.6**: Other spirochetal diseases  
- **1.7**: Other diseases caused by chlamydiae  
- **1.8**: Rickettsioses  
- **1.9**: Viral infections of the central nervous system  
- **1.10**: Arthropod-borne viral diseases and viral hemorrhagic fevers  
- **1.11**: Viral infections characterized by skin and mucous membrane lesions  
- **1.12**: Viral hepatitis  
- **1.13**: HIV disease [Human immunodeficiency virus disease]  
- **1.14**: Other viral diseases  
- **1.15**: Mycoses  
- **1.16**: Protozoan diseases  
- **1.17**: Helminthiases  
- **1.18**: Pediculosis [lice infestation], acariasis [mite infestation], and other parasitic infestations of the skin  
- **1.19**: Sequelae of infectious and parasitic diseases  
- **1.20**: Bacterial, viral and other infectious agents as the cause of diseases classified elsewhere  
- **1.21**: Other infectious diseases  

(truncated...)

### Chapter 21: Factors Influencing Health Status and Contact with Health Services
- **21.1**: Persons encountering health services for examination and investigation  
- **21.2**: Persons with potential health hazards related to communicable diseases  
- **21.3**: Persons encountering health services in connection with reproduction  
- **21.4**: Persons encountering health services for specific procedures and healthcare  
- **21.5**: Persons with potential health hazards related to socioeconomic and psychosocial circumstances  
- **21.6**: Persons encountering health services for other reasons  
- **21.7**: Persons with potential health hazards related to family and personal history and certain conditions influencing health status  

### Chapter 22: Codes for Special Purposes
- **22.1**: Provisional assignments for diseases of uncertain etiology, assigned and unassigned codes  
- **22.2**: Impairment of function  
- **22.3**: Completed registration for organ transplantation  
- **22.4**: Staging of HIV infection  
- **22.5**: Secondary codes to specify cytogenetic and molecular genetic differentiation in neoplasms  
- **22.6**: Secondary codes to specify mental and behavioural disorders  
- **22.7**: Other secondary codes for special purposes  
- **22.8**: Infectious agents resistant to certain antibiotics or chemotherapeutic agents  
- **22.9**: Assigned and unassigned codes  

User Prompt

This is the final discharge report that you are supposed to analyze and for which you are to identify the primary diagnosis:

'${abschlussbericht}'

## Output Format

Please reason step by step, and present your final primary code within \\boxed{}.

Custom Tool

import pprint
import urllib.parse
import json5
from qwen_agent.agents import Assistant
from qwen_agent.tools.base import BaseTool, register_tool
from qwen_agent.utils.output_beautify import typewriter_print
import re

@register_tool('retrieve_icd_chapter')
class RetrieveICDChapter(BaseTool):
    description = (
        'ICD-10-GM Coding Tool: input the subchapter you are interested in, '
        'and the tool returns the content of that chapter including all available codes.'
    )
    parameters = [{
        'name': 'subchapter',
        'type': 'string',
        'description': (
            'The number of the subchapter you are interested in, '
            'e.g., "1.1".'
        ),
        'required': True
    }]

    def call(self, params: str, **kwargs) -> str:
        response = json5.loads(params)['subchapter']
        pattern = r"\\b\\d{1,2}\\.\\d{1,2}\\b"
        match = re.search(pattern, response)
        if match:
            chapter_found = match.group(0)
            code_found = code_range_mapping[chapter_found]
            codes_lower = code_found.lower()
            dict_result_chapter = sorted_combined_dict_translated_v2.get(codes_lower)
            return dict_result_chapter
        else:
            return "No matching chapter found."

Remark: In my previous attempts, I also tried adding all the additional tool-related information in my current system prompt to the tool’s description, which helped slightly. However, still in around 70% of cases, the model still ignores the tool and answers directly.

Hello, I have created a 'abschlussbericht' and attempted to run this script using the Qwen3-32B model, which can correctly initiate tool calls.

I noticed that you did not pass in the parameters. You can try setting the parameters we recommend: Temperature=0.6, TopP=0.95, TopK=20 for thinking mode.

Hello, and thank you again for the feedback. The model does use the tool occasionally, but it's much less reliable than QwQ-32B, even when using the same script and prompts. I'm already using the parameters you suggested (see code snippet below).

I've also noticed that the model almost never queries the tool more than once. It either accepts one of the initially retrieved codes or argues that a different code is correct, but it doesn't validate that choice by calling the tool again. This happens even when I explicitly instruct the model to use the tool multiple times (e.g., by adding the following to the user or system prompt: "Remember to use the retrieve_icd_chapter tool frequently to check all subchapters that might be relevant before you decide on a primary diagnosis code."). Any idea to improve the performance or align my task more closely with the way the model was trained?

def init_agent_service():
    llm_cfg = {

    # Use a model service compatible with the OpenAI API, such as vLLM or Ollama:
    'model':   '/donnees/models/Qwen3-32B', # '/donnees/models/QwQ-32B-AWQ', #  '/donnees/models/Qwen3-14B', # Use the --served-model-name
    'model_server': 'http://localhost:8000/v1',  # Or use the server's IP if connecting remotely
    'api_key': 'EMPTY',   

    # (Optional) LLM hyperparameters for generation:
    'generate_cfg': {
        'fncall_prompt_type': 'nous',
        #'thought_in_content': True,
        'temperature': 0.6,
        'repetition_penalty': 1.0,  # Note 1 below
        'top_k': 20,
        'top_p': 0.95,  
        #"max_tokens": 32768,

    },
}

    tools = [
        'retrieve_icd_chapter',
    ]
    bot = Assistant(
        llm=llm_cfg,
        function_list=tools,
        name='Medical Coding Assistant',
        system_message = system_instruction)
    return bot

Not relevant to this case, but I also found the Qwen3-30B-A3B gguf quantized model tend to ignore tools, I am not sure whether it's a quantization problem or a model issue. I found Qwen3-32B quantization model works as expectation while dealing with tools.

A large custom system prompt like this paired with tool-calling functionalities can sometimes cause complexity, it might help to place them at the start of your user prompt. If the problem remains, try shortening the message and revise the tool description to pinpoint the issue, then gradually expand it. Hope this helps!

this issue needs to be fixed asap. someone needs to submit a pr to solve this issue.

Thank you again, Yang Su, for your response! I dug deeper and ran a few hundred test cases in two settings: single-turn and multi-turn tool usage.

Single-turn (complex system prompt):

  • QwQ-32B invoked the tool correctly in 81% of cases
  • Qwen3 succeeded in only 69% of cases

Multi-turn (Most of these questions required three or more tool calls to solve):

  • QwQ-32B built its final answer on the tool’s output 76% of the time
  • Qwen3 did so only 47% of the time

In short, I have the feeling that QwQ-32B follows complex instructions far more reliably than Qwen3. When a task demands repeated tool use, Qwen3 often drifts from the prompt—truncating or skipping essential steps— whereas QwQ-32B stays on track. Again maybe my prompts are not aligned with the way the model was trained and tested, but based on your feedback and what I found online I don't see a clear way to change that.

Sign up or log in to comment