PaliGemma2 for Medical Image Parsing

This repository provides a PaliGemma2 model fine-tuned for comprehensive medical image question answering and analysis. The model is based on google/paligemma2-10b-pt-224 and was trained on the FLARE 2025 medical multimodal dataset, which includes 19 medical imaging datasets, 50,996 images, and 58,112 question-answer pairs across 8 imaging modalities.

Dataset Summary

  • Total datasets: 19
  • Total images: 50,996
  • Total questions: 58,112
  • Modalities: Clinical, Dermatology, Endoscopy, Mammography, Microscopy, Retinography, Ultrasound, Xray
  • Task types: Classification, Counting, Detection, Multi-label Classification, Regression, Report_Generation, Instance Detection

Training Details

  • Model: PaliGemma2-10B (LoRA fine-tuned, 4-bit quantization)
  • Training epochs: 8
  • Per-device batch size: 1
  • Gradient accumulation steps: 32
  • Effective batch size: 32 (accumulation) x 1 (device) = 32
  • Optimizer: Paged AdamW 8-bit
  • Learning rate: 5e-5
  • Warmup ratio: 0.03
  • Max grad norm: 1.0
  • LoRA configuration: r=16, alpha=32, dropout=0.05
  • Target modules: q_proj, o_proj, k_proj, v_proj, gate_proj, up_proj, down_proj

Model Performance

Task Metric (Description) Value #Examples
classification balanced accuracy 0.4723 3513
multi-label classification F1 score (micro) 0.5040 1446
detection F1 score (IoU>0.5) 0.3446 255
instance_detection F1 score (IoU>0.5) 0.0028 176
counting mean absolute error 295.6500 100
regression mean absolute error 16.5035 100
report_generation GREEN score 0.7072 1945

Usage

from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
from peft import PeftModel
from PIL import Image
import torch

base_model_id = "google/paligemma2-10b-pt-224"
model_id = "yws0322/flare25-paligemma2"

processor = PaliGemmaProcessor.from_pretrained(base_model_id)
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True
)
model = PeftModel.from_pretrained(base_model, model_id)

image = Image.open("chest_xray.jpg")
question = "What are the key findings in this chest X-ray?"
image_token = "<image>"
prompt = f"{image_token * processor.image_seq_length}{processor.tokenizer.bos_token}Analyze the given medical image and answer the following question:\nQuestion: {question}\nPlease provide a clear and concise answer."
inputs = processor(images=image, text=prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Related Resources

Citation

If you use this model in your research, please cite:

@misc{flare25paligemma2025,
  title={FLARE25-PaliGemma2},
  author={Yeonwoo Seo},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/yws0322/flare25-paligemma2}
}

@misc{paligemma2-base,
  title={PaliGemma2: Multimodal Vision-Language Model by Google Research},
  author={Google Research},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/google/paligemma2-10b-pt-224}
}

Model uploaded on 2025-06-03

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yws0322/flare25-paligemma2

Adapter
(5)
this model

Dataset used to train yws0322/flare25-paligemma2