CircleRadon's picture
Fix task tag (#1)
f9eb33d verified
---
license: apache-2.0
language:
- en
metrics:
- accuracy
library_name: transformers
pipeline_tag: video-text-to-text
tags:
- multimodal large language model
- large video-language model
base_model:
- DAMO-NLP-SG/VideoLLaMA3-2B-Image
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/ZrZPYT0Q3wgza7Vc5BmyD.png" width="100%" style="margin-bottom: 0.2;"/>
<p>
<h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#4D2B24">
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM</a></h3>
<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoRefer">Github</a> for the latest update. </h2>
<div style="display: flex; justify-content: center; margin-top: 10px;">
<a href="https://arxiv.org/pdf/2501.00599"><img src="https://img.shields.io/badge/Arxiv-2501.00599-ECA8A7" style="margin-right: 5px;"></a>
<a href="https://huggingface.co/spaces/lixin4ever/VideoRefer-VideoLLaMA3"><img src='https://img.shields.io/badge/HuggingFace-Demo-96D03A' style="margin-right: 5px;"></a>
<a href="https://github.com/DAMO-NLP-SG/VideoRefer"><img src='https://img.shields.io/badge/Github-VideoRefer-F7C97E' style="margin-right: 5px;"></a>
<a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3"><img src='https://img.shields.io/badge/Github-VideoLLaMA3-9DC3E6' style="margin-right: 5px;"></a>
</div>
## 📰 News
* **[2025.6.19]** 🔥We release the [demo](https://huggingface.co/spaces/lixin4ever/VideoRefer-VideoLLaMA3) of VideoRefer-VideoLLaMA3, hosted on HuggingFace. Feel free to try it!
* **[2025.6.18]** 🔥We release a new version of VideoRefer([VideoRefer-VideoLLaMA3-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-VideoLLaMA3-7B) and [VideoRefer-VideoLLaMA3-2B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-VideoLLaMA3-2B)), which are trained based on [VideoLLaMA3](https://github.com/DAMO-NLP-SG/VideoLLaMA3).
* **[2025.4.22]** 🔥Our VideoRefer-Bench has been adopted in [Describe Anything Model](https://arxiv.org/pdf/2504.16072) (NVIDIA & UC Berkeley).
* **[2025.2.27]** 🔥VideoRefer Suite has been accepted to CVPR2025!
* **[2025.2.18]** 🔥We release the [VideoRefer-700K dataset](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K) on HuggingFace.
* **[2025.1.1]** 🔥We release [VideoRefer-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B), the code of VideoRefer and the [VideoRefer-Bench](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-Bench).
## 🌏 Model Zoo
| Model Name | Visual Encoder | Language Decoder |
|:----------------|:----------------|:------------------|
| [VideoRefer-VideoLLaMA3-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-VideoLLaMA3-7B) | [VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| [VideoRefer-VideoLLaMA3-2B (This Checkpoint)](https://huggingface.co/DAMO-NLP-SG/VideoRefer-VideoLLaMA3-2B) | [VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |
| [VideoRefer-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |
| [VideoRefer-7B-stage2](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B-stage2) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |
| [VideoRefer-7B-stage2.5](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B-stage2.5) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |
## 📑 Citation
If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:
```bibtex
@InProceedings{Yuan_2025_CVPR,
author = {Yuan, Yuqian and Zhang, Hang and Li, Wentong and Cheng, Zesen and Zhang, Boqiang and Li, Long and Li, Xin and Zhao, Deli and Zhang, Wenqiao and Zhuang, Yueting and Zhu, Jianke and Bing, Lidong},
title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {18970-18980}
}
@article{damonlpsg2025videollama3,
title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
journal={arXiv preprint arXiv:2501.13106},
year={2025},
url = {https://arxiv.org/abs/2501.13106}
}
```