Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation (ICCV 2025)

Luca Barsellotti* Lorenzo Bianchi* Nicola Messina Fabio Carrara Marcella Cornia Lorenzo Baraldi Fabrizio Falchi Rita Cucchiara

Installation

# Create a new environment with Python 3.10
conda create --name talk2dino python=3.10 -c conda-forge
conda activate talk2dino

# Install compilers for C++/CUDA extensions
conda install -c conda-forge "gxx_linux-64=11.*" "gcc_linux-64=11.*"

# Install CUDA toolkit and cuDNN
conda install -c nvidia/label/cuda-11.7.0 cuda 
conda install -c nvidia/label/cuda-11.7.0 cuda-nvcc
conda install -c conda-forge cudnn cudatoolkit=11.7.0

# Install PyTorch 2.1 with CUDA 11.8 support
# Note: This is crucial, as it matches the requirements of mmcv-full 1.7.2
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# Install other dependencies
pip install -r requirements.txt
pip install -U openmim
mim install mmengine

# Install a compatible version of mmcv-full (1.7.2) for PyTorch 2.1
pip install mmcv-full==1.7.2 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1.0/index.html

# Install mmsegmentation
pip install mmsegmentation==0.30.0

Qualitative Results

Image	Ground Truth	FreeDA	ProxyCLIP	CLIP-DINOiser	Ours (Talk2DINO)

Reference

If you found this code useful, please cite the following paper:

@misc{barsellotti2024talkingdinobridgingselfsupervised,
      title={Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation}, 
      author={Luca Barsellotti and Lorenzo Bianchi and Nicola Messina and Fabio Carrara and Marcella Cornia and Lorenzo Baraldi and Fabrizio Falchi and Rita Cucchiara},
      year={2024},
      eprint={2411.19331},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.19331}, 
}