Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Yunheng Li · Yuxuan Li · Quansheng Zeng · Wenhai Wang · Qibin Hou† · Ming-Ming Cheng

Accepted By ICCV 2025!

[Paper] [Github] [Pretrained models]

Contributions

🔥 We identify the foreground bias issue in existing VLMs and propose region-text alignment by incorporating explicit semantic structuring through category guidance.
🔥 We propose DenseVLM, a region-language alignment framework that uses a strong VLM to retrieve categories for unlabeled regions and decouples foreground and background features to reduce bias.
🔥 Extensive experiments on dense prediction benchmarks show that our DenseVLM outperforms previous methods and exhibits promising scalability.

Problem analysis of foreground bias. Comparison of different VLMs.

Overview

DenseVLM is an unsupervised fine-tuning framework for open-vocabulary dense prediction tasks, which retrieves region-level semantics from a powerful vision-language model and decouples foreground and background features to achieve unbiased region-language alignment and improved open-vocabulary dense prediction.

Problem analysis of foreground bias. Comparison of different VLMs.

TODO

Release the training and inference code of DenseVLM.
Supports training and inference code for RegionCLIP and CLIPSelf.
Release the code to integrate DenseVLM into CAT-Seg.
Release the code to integrate DenseVLM into F-ViT.

Quick Start

🚀 Linux system with CUDA 11.8
🚀 At least one RTX 3090 GPU (4 GPUs are default for training ～23min/epoch)

1. Create Conda Environment

The provided environment is suggested for reproducing our results, similar configurations may also work.

git clone [email protected]:HVision-NKU/DenseVLM.git
cd DenseVLM

conda create -n DenseVLM python=3.8.20
conda activate DenseVLM
pip install -r requirements.txt
pip install -e . -v

2. Data Preparation

The main experiments are conducted using images from COCO and ADE20k datasets. Please prepare datasets and organize them like the following:

DenseVLM/
├── data
    ├── coco
        ├── annotations
            ├── instances_train2017.json  
            ├── panoptic_val2017.json
            ├── panoptic_val2017/    
        ├── train2017/
        ├── val2017/
        ├── coco_pseudo_4764.json
        ├── coco_proposals.json     
    ├── ADEChallengeData2016
        ├── ade20k_panoptic_val/
        ├── images/validation/
        ├── ade20k_panoptic_val.json

3. Checkpoints

Please download the pretrained weights from huggingface and organize them like the

DenseVLM/
├── checkpoints
    ├── EVA02_CLIP_B_psz16_s8B.pt    
    ├── clipself_coco_6_save6_512_eva_vitl14_24layers.pt  
    ├── densevlm_coco_6_save6_512_eva_vib16_12layers.pt

If using a fine-tuned CLIP, you can directly use it. For example:

model = open_clip.create_model(
    'EVA02-CLIP-B-16', pretrained='eva', cache_dir='checkpoints/densevlm_coco_6_save6_512_eva_vib16_12layers.pt'
)

4. Training and Testing

To fine-tune the CLIP model using densevlm, run:

bash scripts/train_densevlm_coco_image_patches_eva_vitb16.sh

To evaluate the CLIP model fine-tuned with densevlm, run:

bash scripts/test_coco_eva_vitb16_macc_boxes_masks.sh path/to/checkpoint.pt 2 densevlm_coco_test

bash scripts/test_ade_eva_vitb16_macc_boxes_masks.sh path/to/checkpoint.pt 2 densevlm_ade_test

🙏 Citation:

If you find this project useful, please consider citing:

@article{li2024densevlm,
  title={Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction},
  author={Li, Yunheng and Li, Yuxuan and Zeng, Quansheng and Wang, Wenhai and Hou, Qibin and Cheng, Ming-Ming},
  journal={arXiv preprint arXiv:2412.06244},
  year={2024}
}

@InProceedings{li2024cascadeclip,
  title={Cascade-{CLIP}: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation},
  author={Li, Yunheng and Li, Zhong-Yu and Zeng, Quan-Sheng and Hou, Qibin and Cheng, Ming-Ming},
  booktitle={Proceedings of the 41st International Conference on Machine Learning},
  pages={28243--28258},
  year={2024},
  volume={235},
  month={21--27 Jul},
  publisher={PMLR}
}

License

Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should get formal permission first.