m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
Abstract
Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.
Community
m1 is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:
Fine-tuning on a small, high-quality set of verified medical reasoning examples, showing that even with just 1K–23K examples, m1-7B surpasses previous SOTA models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B rivals 70B-scale models.
Scaling reasoning at inference using token budgets, which consistently improves performance across medical QA tasks—up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking.
Identifying medical knowledge as the key bottleneck, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.
We open-sourced our models, data, and code at https://github.com/UCSC-VLAA/m1 🤗🤗🤗
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking (2025)
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? (2025)
- Theorem Prover as a Judge for Synthetic Data Generation (2025)
- Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models (2025)
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task (2025)
- MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning (2025)
- InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper