QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training
This repository contains the model weights for QoQ-Med-VL-7B (Qwen Omni-Reasoning on Medical Questions), a multimodal clinical foundation model with reasoning capabilities.
Model Weights
Model | Weights | Avg. Val Accuracy |
---|---|---|
QoQ-Med-VL-7B | 🤗 HuggingFace | 68.6% |
QoQ-Med-VL-32B | 🤗 HuggingFace | 70.7% |
Quick Start
Use with Front End Apps
Prefer a point-and-click experience? Community-maintained GGUF builds are already on the Hub. They load instantly in desktop chat front-ends such as LM Studio, Ollama, and other llama.cpp-compatible apps—just search for “QoQ-Med-VL-7B/32B,” click Download, and start chatting. No Python environment, GPU, or command-line setup required.
Model | Format | HuggingFace Link |
---|---|---|
QoQ-Med-VL-7B | GGUF | mradermacher/QoQ-Med-VL-7B-GGUF |
QoQ-Med-VL-7B-i1 | GGUF | mradermacher/QoQ-Med-VL-7B-i1-GGUF |
QoQ-Med-VL-32B | GGUF | mradermacher/QoQ-Med-VL-32B-GGUF |
Installation
First, ensure you have the necessary dependencies:
pip install transformers qwen-vl-utils torch
Loading the Model
You may load the QoQ-Med model and processors via transformers package:
from transformers import AutoModelForVision2Seq, AutoProcessor
model = AutoModelForVision2Seq.from_pretrained(
"ddvd233/QoQ-Med-VL-7B",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("ddvd233/QoQ-Med-VL-7B")
For better performance with flash attention:
import torch
from transformers import AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained(
"ddvd233/QoQ-Med-VL-7B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
Configuring Visual Token Range
You can adjust the visual token range to balance performance and computational cost:
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
"ddvd233/QoQ-Med-VL-7B",
min_pixels=min_pixels,
max_pixels=max_pixels
)
Preparing Multimodal Input
Create a message with both image and text content:
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/your/medical/image.jpg",
},
{"type": "text", "text": "Describe this medical image."},
],
}
]
Processing the Input
Prepare the input for model inference:
from qwen_vl_utils import process_vision_info
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
Generating Output
Run inference and decode the output:
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text[0])
Citations
If you find the project useful, please cite the following papers:
@article{dai2025climb,
title={Climb: Data foundations for large scale multimodal clinical foundation models},
author={Dai, Wei and Chen, Peilin and Lu, Malinda and Li, Daniel and Wei, Haowen and Cui, Hejie and Liang, Paul Pu},
journal={International Conference on Machine Learning},
year={2025}
}
@article{dai2025qoq,
title={QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training},
author={Dai, Wei and Chen, Peilin and Ekbote, Chanakya and Liang, Paul Pu},
journal={arXiv preprint arXiv:2506.00711},
year={2025}
}
- Downloads last month
- 125