arxiv:2510.15042

Comprehensive language-image pre-training for 3D medical image understanding

Published on Oct 16

Authors:

Abstract

The Comprehensive Language-image Pre-training (COLIPRI) encoder family enhances 3D vision-language pre-training by combining image-only and paired image-text datasets, achieving superior performance in report generation, classification, and zero-shot classification.

AI-generated summary

Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.15042 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.15042 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.15042 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.