Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

This is a checkpoint for Stable Diffusion fine-tuned with CXR-BERT on the MIMIC-CXR dataset, as presented in the paper Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models.

This model introduces a novel approach to phrase grounding in medical imaging, demonstrating that generative text-to-image diffusion models, specifically fine-tuned Stable Diffusion, can achieve superior zero-shot performance compared to traditional discriminative methods. Key innovations include:

Leveraging cross-attention maps from generative diffusion models for phrase grounding.
Fine-tuning diffusion models with a frozen, domain-specific language model (CXR-BERT) to significantly improve performance in medical contexts.
Introducing Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to refine cross-attention maps and enhance localization accuracy.

The model aims to map natural language phrases from clinical reports to specific image regions, facilitating disease localization. Training was performed for 30,000 steps using eight A100 GPUs.

Base Models and Datasets

Stable Diffusion: runwayml/stable-diffusion-v1-5
CXR-BERT: microsoft/BiomedVLP-CXR-BERT-specialized
MIMIC-CXR (dataset): physionet.org/content/mimic-cxr/2.0.0/

Usage and Reproduction

To reproduce the results of the "Generate to Ground" paper, including environment setup, data preparation, and execution of evaluation scripts (with and without Bimodal Bias Merging), please refer to the official GitHub repository. The repository provides comprehensive instructions and the corresponding code:

https://github.com/Felix-012/generate_to_ground/

Citation

If you find this work helpful or inspiring, please consider citing the original paper:

@inproceedings{
nutzel2025generate,
title={Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models},
author={Felix N{\"u}tzel and Mischa Dombrowski and Bernhard Kainz},
booktitle={Medical Imaging with Deep Learning},
year={2025},
url={https://openreview.net/forum?id=yTjotBI30L}
}

Acknowledgement

(Some) HPC resources were provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR projects b143dc and b180dc. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.