VideoITG-8B

[🌐Homepage] [💻GitHub] [📜Tech Report] [🤗VideoITG-40K]

Introduction

VideoITG-8B is a multimodal video understanding model trained with instructed temporal grounding, equipped with the ability to enhance Video Large Language Models through intelligent frame selection. The model tackles the complexities of real-world video scenarios by aligning frame sampling with user instructions. Please check our paper for more details.

Model Details

  • Model name: VideoITG-8B
  • Architecture: Customized Eagle-8B base model, fine-tuned with Instructed Temporal Grounding
  • Model type: Multimodal Large Language Model with Video Understanding
  • Languages: English (primary), multilingual (partially)

Model Performance

Model Base Model Frames LongVideoBench MLVU VideoMME CG-Bench
VideoITG-7B InternVL2.5-8B 32 61.9 (+2.9%) 75.0 (+7.8%) 67.3 (+4.0%) 46.7 (+7.0%)
VideoITG-7B InternVL2.5-26B 32 63.0 (+1.0%) 78.9 (+6.1%) 69.9 (+2.5%) 48.7 (+6.0%)
VideoITG-7B LLaVA-Video-7B 32 61.6 (+3.6%) 74.6 (+8.6%) 66.1 (+3.0%) 42.8 (+9.0%)
VideoITG-7B LLaVA-Video-7B 64 60.9 (+7.4%) 76.3 (+7.6%) 66.4 (+1.9%) 42.9 (+8.1%)

Key Features

  • Instructed Temporal Grounding: Intelligently selects video frames based on user instructions
  • Plug-and-Play: Seamlessly integrates with existing video language models
  • Superior Temporal Understanding: Excels in tasks requiring precise temporal grounding

License

Citation

If you find this project useful, please cite our work:

@article{wang2025videoitg,
  title     = {VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding},
  author    = {Shihao Wang and Guo Chen and De-An Huang and Zhiqi Li and Minghan Li and Guilin Liu and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
  journal   = {arXiv preprint arXiv:2507.13353},
  year      = {2025}
}

Acknowledgement

  • Eagle: The codebase we built upon
  • LMMs-Eval: Many thanks to the LMMs-Lab for the easy-to-use evaluation tools
  • LLaVA-OneVision and LLaVA-Video: We train our models with data from these great open-source projects
Downloads last month
9
Safetensors
Model size
7.52B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/VideoITG-8B