Image-Text-to-Text

UniVLG: Unifying 2D and 3D Vision-Language Understanding

This repository contains the UniVLG model, as presented in Unifying 2D and 3D Vision-Language Understanding. UniVLG is a unified architecture for 2D and 3D vision-language understanding.

Project page: https://univlg.github.io

The model uses a custom loading tool (uvx). Checkpoints are available on Hugging Face: Hugging Face. See the GitHub repository for code and instructions.

Citation

@article{jain2025unifying,
  title={Unifying 2D and 3D Vision-Language Understanding},
  author={Jain, Ayush and Swerdlow, Alexander and Wang, Yuzhou and Arnaud, Sergio and Martin, Ada and Sax, Alexander and Meier, Franziska and Fragkiadaki, Katerina},
  journal={arXiv preprint arXiv:2503.10745},
  year={2025}
}

License Note: The majority of UniVLG is licensed under CC-BY-NC, however, portions of the project (specifically Odin and Pointcept) are available under separate MIT license terms. Please refer to the GitHub repository for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including katefgroup/UniVLG