m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models

A simple test-time scaling strategy, with minimal fine-tuning, can unlock strong medical reasoning within large language models.

⚡ Introduction

Hi! Welcome to the repository for m1 (📃 Paper)!

m1 is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:

  • Fine-tuning on a small, high-quality set of verified medical reasoning examples, showing that even with just 1K–23K examples, m1-7B surpasses previous SOTA models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B rivals 70B-scale models.

  • Scaling reasoning at inference using token budgets, which consistently improves performance across medical QA tasks: up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking.

  • Identifying medical knowledge as the key bottleneck, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.

We open-sourced our models, data, and code here.


Updates:

  • 2025-03: We release our code, data, models, and paper!

🌍 Environment

Please refer to docs/ENV.md.

👨‍⚕️ Models and Data

Model Backbone Training Data Link
m1-32b-1k Qwen2.5-32B-Instruct m1k HF Link
m1-7b-1k Qwen2.5-7B-Instruct m1k HF Link
m1-7b-23k Qwen2.5-7B-Instruct m23k HF Link

🏃 Inference

(... same content as original README ...)

📖 Citation

@misc{huang2025m1UnleashPotential,
      title={m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models}, 
      author={Xiaoke Huang and Juncheng Wu and Hui Liu and Xianfeng Tang and Yuyin Zhou},
      year={2025},
      eprint={2504.00869},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.00869}, 
}
Downloads last month
9
Safetensors
Model size
32.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UCSC-VLAA/m1-32B-1K

Quantizations
1 model

Collection including UCSC-VLAA/m1-32B-1K