PaliGemma2 for Medical Image Parsing
This repository provides a PaliGemma2 model fine-tuned for comprehensive medical image question answering and analysis. The model is based on google/paligemma2-10b-pt-224 and was trained on the FLARE 2025 medical multimodal dataset, which includes 19 medical imaging datasets, 50,996 images, and 58,112 question-answer pairs across 8 imaging modalities.
Dataset Summary
- Total datasets: 19
- Total images: 50,996
- Total questions: 58,112
- Modalities: Clinical, Dermatology, Endoscopy, Mammography, Microscopy, Retinography, Ultrasound, Xray
- Task types: Classification, Counting, Detection, Multi-label Classification, Regression, Report_Generation, Instance Detection
Training Details
- Model: PaliGemma2-10B (LoRA fine-tuned, 4-bit quantization)
- Training epochs: 8
- Per-device batch size: 1
- Gradient accumulation steps: 32
- Effective batch size: 32 (accumulation) x 1 (device) = 32
- Optimizer: Paged AdamW 8-bit
- Learning rate: 5e-5
- Warmup ratio: 0.03
- Max grad norm: 1.0
- LoRA configuration: r=16, alpha=32, dropout=0.05
- Target modules: q_proj, o_proj, k_proj, v_proj, gate_proj, up_proj, down_proj
Model Performance
Task | Metric (Description) | Value | #Examples |
---|---|---|---|
classification | balanced accuracy | 0.4723 | 3513 |
multi-label classification | F1 score (micro) | 0.5040 | 1446 |
detection | F1 score (IoU>0.5) | 0.3446 | 255 |
instance_detection | F1 score (IoU>0.5) | 0.0028 | 176 |
counting | mean absolute error | 295.6500 | 100 |
regression | mean absolute error | 16.5035 | 100 |
report_generation | GREEN score | 0.7072 | 1945 |
Usage
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
from peft import PeftModel
from PIL import Image
import torch
base_model_id = "google/paligemma2-10b-pt-224"
model_id = "yws0322/flare25-paligemma2"
processor = PaliGemmaProcessor.from_pretrained(base_model_id)
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
load_in_4bit=True
)
model = PeftModel.from_pretrained(base_model, model_id)
image = Image.open("chest_xray.jpg")
question = "What are the key findings in this chest X-ray?"
image_token = "<image>"
prompt = f"{image_token * processor.image_seq_length}{processor.tokenizer.bos_token}Analyze the given medical image and answer the following question:\nQuestion: {question}\nPlease provide a clear and concise answer."
inputs = processor(images=image, text=prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Related Resources
- https://huggingface.co/datasets/FLARE-MedFM/FLARE-Task5-MLLM-2D
- https://huggingface.co/docs/transformers/model_doc/paligemma
Citation
If you use this model in your research, please cite:
@misc{flare25paligemma2025,
title={FLARE25-PaliGemma2},
author={Yeonwoo Seo},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/yws0322/flare25-paligemma2}
}
@misc{paligemma2-base,
title={PaliGemma2: Multimodal Vision-Language Model by Google Research},
author={Google Research},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/google/paligemma2-10b-pt-224}
}
Model uploaded on 2025-06-03
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for yws0322/flare25-paligemma2
Base model
google/paligemma2-10b-pt-224