Disentangling Reasoning and Knowledge in Medical Large Language Models

Introduction

overall_workflow

Medical reasoning in large language models (LLMs) aims to replicate clinicians' cognitive processes when interpreting patient data and making diagnostic decisions. However, evaluating true reasoning capabilities remains challenging, as widely used benchmarks-such as MedQA-USMLE, MedMCQA, and PubMedQA-often conflate questions requiring medical reasoning with those solvable through factual recall. We address this limitation by systematically disentangling reasoning-heavy from knowledge-heavy questions across 11 biomedical QA benchmarks using a PubMedBERT-based classifier that achieves human-level performance (81%). Our analysis reveals that only 32.8% of benchmark questions involve complex reasoning, with the majority focused on factual understanding. Using this stratified dataset, we evaluate recent biomedical reasoning models (HuatuoGPT-o1, MedReason, m1) alongside general-domain models (DeepSeek-R1, o4-mini, Qwen3) and observe a consistent performance gap between knowledge and reasoningโ€”for example, m1 scores 60.5% vs. 47.1%, respectively. To assess robustness, we conduct adversarial evaluations where models are prefilled with incorrect answers before being asked to reconsider. Biomedical models show substantial degradation in this setting (e.g., MedReason drops from 44.4% to 29.3%), while RL-trained and larger general-domain models are more resilient. Based on these insights, we train BioMed-R1-8B using supervised fine-tuning and reinforcement learning on reasoning-heavy examples. While it achieves the strongest overall and adversarial performance among similarly sized models, there remains ample room for improvement. Incorporating additional reasoning-rich data sources, such as clinical case reports, and training on adversarial or backtracking scenariosโ€”with reinforcement learning to encourage self-correctionโ€”may further enhance robustness and reliability.

reason_vs_knowledge

BioMed-R1 can be used just like Llama-3.1-8B-Instruct. You can deploy it with tools like vllm or Sglang, or perform direct inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BioMed-R1",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("BioMed-R1")

input_text = "Does vagus nerve contribute to the development of steatohepatitis and obesity in phosphatidylethanolamine N-methyltransferase deficient mice?"
messages = [{"role": "user", "content": input_text}]

inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True
), return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

๐Ÿ™๐Ÿผ Acknowledgement

We gratefully acknowledge the contributions of HuatuoGPT-o1, MedReason, and M1.
We also thank the developers of the outstanding tools Curator, TRL, vLLM, and SGLang, which made this work possible.

๐Ÿ“– Citation

@article{thapa2025disentangling,
  title={Disentangling Reasoning and Knowledge in Medical Large Language Models},
  author={Thapa, Rahul and Wu, Qingyang and Wu, Kevin and Zhang, Harrison and Zhang, Angela and Wu, Eric and Ye, Haotian and Bedi, Suhana and Aresh, Nevin and Boen, Joseph and Reddy, Shriya and Athiwaratkun, Ben and Song, Shuaiwen Leon and Zou, James},
  journal={arXiv preprint arXiv:2505.11462},
  year={2025},
  url={https://arxiv.org/abs/2505.11462}
}
Downloads last month
29
Safetensors
Model size
8.03B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zou-lab/BioMed-R1-8B

Finetuned
(1420)
this model