Disentangling Reasoning and Knowledge in Medical Large Language Models
Introduction

Medical reasoning in large language models (LLMs) aims to replicate clinicians' cognitive processes when interpreting patient data and making diagnostic decisions. However, evaluating true reasoning capabilities remains challenging, as widely used benchmarks-such as MedQA-USMLE, MedMCQA, and PubMedQA-often conflate questions requiring medical reasoning with those solvable through factual recall. We address this limitation by systematically disentangling reasoning-heavy from knowledge-heavy questions across 11 biomedical QA benchmarks using a PubMedBERT-based classifier that achieves human-level performance (81%). Our analysis reveals that only 32.8% of benchmark questions involve complex reasoning, with the majority focused on factual understanding. Using this stratified dataset, we evaluate recent biomedical reasoning models (HuatuoGPT-o1, MedReason, m1) alongside general-domain models (DeepSeek-R1, o4-mini, Qwen3) and observe a consistent performance gap between knowledge and reasoningโfor example, m1 scores 60.5% vs. 47.1%, respectively. To assess robustness, we conduct adversarial evaluations where models are prefilled with incorrect answers before being asked to reconsider. Biomedical models show substantial degradation in this setting (e.g., MedReason drops from 44.4% to 29.3%), while RL-trained and larger general-domain models are more resilient. Based on these insights, we train BioMed-R1-8B using supervised fine-tuning and reinforcement learning on reasoning-heavy examples. While it achieves the strongest overall and adversarial performance among similarly sized models, there remains ample room for improvement. Incorporating additional reasoning-rich data sources, such as clinical case reports, and training on adversarial or backtracking scenariosโwith reinforcement learning to encourage self-correctionโmay further enhance robustness and reliability.

BioMed-R1 can be used just like Llama-3.1-8B-Instruct
. You can deploy it with tools like vllm or Sglang, or perform direct inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("BioMed-R1",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("BioMed-R1")
input_text = "Does vagus nerve contribute to the development of steatohepatitis and obesity in phosphatidylethanolamine N-methyltransferase deficient mice?"
messages = [{"role": "user", "content": input_text}]
inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True
), return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐๐ผ Acknowledgement
We gratefully acknowledge the contributions of HuatuoGPT-o1, MedReason, and M1.
We also thank the developers of the outstanding tools Curator, TRL, vLLM, and SGLang, which made this work possible.
๐ Citation
@article{thapa2025disentangling,
title={Disentangling Reasoning and Knowledge in Medical Large Language Models},
author={Thapa, Rahul and Wu, Qingyang and Wu, Kevin and Zhang, Harrison and Zhang, Angela and Wu, Eric and Ye, Haotian and Bedi, Suhana and Aresh, Nevin and Boen, Joseph and Reddy, Shriya and Athiwaratkun, Ben and Song, Shuaiwen Leon and Zou, James},
journal={arXiv preprint arXiv:2505.11462},
year={2025},
url={https://arxiv.org/abs/2505.11462}
}
- Downloads last month
- 29
Model tree for zou-lab/BioMed-R1-8B
Base model
meta-llama/Llama-3.1-8B