arxiv:2509.24200

UniVid: The Open-Source Unified Video Model

Published on Sep 29

· Submitted by

Zeyu Zhang on Sep 30

Upvote

Authors:

Junhui Lin ,

Zeyu Zhang ,

Abstract

UniVid combines video understanding and generation using an MLLM with a diffusion decoder, achieving state-of-the-art performance through Temperature Modality Alignment and Pyramid Reflection.

AI-generated summary

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

View arXiv page View PDF Project page GitHub 16 Add to collection

Community

SteveZeyuZhang

Paper author Paper submitter 12 days ago

TL;DR: We present UniVid, an open-source unified video model for both understanding and generation tasks. Our model requires only a small amount of high-quality data for fine-tuning, achieveing competitive results across various tasks.

librarian-bot

11 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.24200 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.24200 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.24200 in a Space README.md to link it from this page.