QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

This repository contains the model weights for QoQ-Med-VL-7B (Qwen Omni-Reasoning on Medical Questions), a multimodal clinical foundation model with reasoning capabilities.

Model Weights

Model	Weights	Avg. Val Accuracy
QoQ-Med-VL-7B	🤗 HuggingFace	68.6%
QoQ-Med-VL-32B	🤗 HuggingFace	70.7%

Quick Start

Use with Front End Apps

Prefer a point-and-click experience? Community-maintained GGUF builds are already on the Hub. They load instantly in desktop chat front-ends such as LM Studio, Ollama, and other llama.cpp-compatible apps—just search for “QoQ-Med-VL-7B/32B,” click Download, and start chatting. No Python environment, GPU, or command-line setup required.

Model	Format	HuggingFace Link
QoQ-Med-VL-7B	GGUF	mradermacher/QoQ-Med-VL-7B-GGUF
QoQ-Med-VL-7B-i1	GGUF	mradermacher/QoQ-Med-VL-7B-i1-GGUF
QoQ-Med-VL-32B	GGUF	mradermacher/QoQ-Med-VL-32B-GGUF

Installation

First, ensure you have the necessary dependencies:

pip install transformers qwen-vl-utils torch

Loading the Model

You may load the QoQ-Med model and processors via transformers package:

from transformers import AutoModelForVision2Seq, AutoProcessor

model = AutoModelForVision2Seq.from_pretrained(
    "ddvd233/QoQ-Med-VL-7B", 
    torch_dtype="auto", 
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("ddvd233/QoQ-Med-VL-7B")

For better performance with flash attention:

import torch
from transformers import AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained(
    "ddvd233/QoQ-Med-VL-7B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Configuring Visual Token Range

You can adjust the visual token range to balance performance and computational cost:

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28

processor = AutoProcessor.from_pretrained(
    "ddvd233/QoQ-Med-VL-7B", 
    min_pixels=min_pixels, 
    max_pixels=max_pixels
)

Preparing Multimodal Input

Create a message with both image and text content:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/your/medical/image.jpg",
            },
            {"type": "text", "text": "Describe this medical image."},
        ],
    }
]

Processing the Input

Prepare the input for model inference:

from qwen_vl_utils import process_vision_info

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

Generating Output

Run inference and decode the output:

generated_ids = model.generate(**inputs, max_new_tokens=128)

generated_ids_trimmed = [
    out_ids[len(in_ids):] 
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)

print(output_text[0])

Citations

If you find the project useful, please cite the following papers:

@article{dai2025climb,
  title={Climb: Data foundations for large scale multimodal clinical foundation models},
  author={Dai, Wei and Chen, Peilin and Lu, Malinda and Li, Daniel and Wei, Haowen and Cui, Hejie and Liang, Paul Pu},
  journal={International Conference on Machine Learning},
  year={2025}
}
@article{dai2025qoq,
  title={QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training},
  author={Dai, Wei and Chen, Peilin and Ekbote, Chanakya and Liang, Paul Pu},
  journal={arXiv preprint arXiv:2506.00711},
  year={2025}
}

ddvd233
/

QoQ-Med-VL-7B