<div align="center">
<h1>
  <b>m1</b>: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models
</h1>
<p>
A simple test-time scaling strategy, with minimal fine-tuning, can unlock strong medical reasoning within large language models.
</p>
</div>

This repository contains the model presented in the paper [m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models](https://huggingface.co/papers/2504.00869).

Code: https://github.com/UCSC-VLAA/m1

## ⚡ Introduction

Hi! Welcome to the huggingface repository for m1! 

**m1** is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time “thinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through:

- **Fine-tuning on a small, high-quality set of verified medical reasoning examples**, showing that even with just 1K–23K examples, m1-7B *surpasses* models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B *rivals* 70B-scale models.

- **Scaling reasoning at inference using token budgets**, which consistently improves performance across medical QA tasks—up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking.

- **Identifying medical knowledge as the key bottleneck**, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity.

Downloads last month: 11

Safetensors

Model size

8B params

Tensor type

F32

Model tree for UCSC-VLAA/m1-7B-23K

Quantizations

1 model

Collection including UCSC-VLAA/m1-7B-23K

m1

Collection

8 items • Updated Aug 15, 2025

Paper for UCSC-VLAA/m1-7B-23K

m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

Paper • 2504.00869 • Published Apr 1, 2025 • 10