VORTA: Efficient Video Diffusion via Routing Sparse Attention
\ud83d\udcda Paper | \ud83d\udcbb Code
NeurIPS '25 - VORTA accelerates video diffusion transformers by sparse attention and dynamic routing, achieving speedup with negligible quality loss.
Abstract
Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.
Installation
Install Pytorch. We have tested the code with PyTorch 2.6.0 and CUDA 12.6, but it should work with other versions as well. You can install PyTorch using the following command:
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126
Install the dependencies:
python -m pip install -r requirements.txt
Sample Usage (Inference)
We use the general scripts to demonstrate the usage of our method. You can find the detailed scripts for each model in the scripts
folder of the VORTA GitHub repository:
- HunyuanVideo:
scripts/hunyuan/inference.sh
- Wan 2.1:
scripts/wan/inference.sh
First, download the ready-to-use router weights. Assuming this repository is cloned as VORTA
from the GitHub repository:
git lfs install
git clone [email protected]:Wenhao-Sun/VORTA
# mv VORTA/<model_name> results/, <model_name>: wan-14B, hunyuan; e.g.
mv VORTA/wan-14B results/
Run the video DiTs with VORTA for acceleration (example for wan
model):
CUDA_VISIBLE_DEVICES=0 python scripts/wan/inference.py \
--pretrained_model_path Wan-AI/Wan2.1-T2V-14B-Diffusers \
--val_data_json_file prompt.json \
--output_dir results/wan-14B/vorta \
--resume_dir results/wan-14B/train \
--resume ckpt/step-000100 \
--enable_cpu_offload \
--seed 1234
For the hunyuan
model, replace wan
with hunyuan
in the script path and output directory, and use hunyuanvideo-community/HunyuanVideo
as the --pretrained_model_path
.
You can edit the prompts.json
or the --val_data_json_file
option to change the text prompt. See the source code scripts/<model_name>/inference.py
or use python scripts/<model_name>/inference.py --help
command for more detailed explanations of the arguments.
Acknowledgements
Thanks to the authors of the following repositories for their great works and open-sourcing the code and models: Diffusers, HunyuanVideo, Wan 2.1, FastVideo
Citation
If you find our work helpful or inspiring, please feel free to cite it.
@article{wenhao_2025_vorta,
author = {Sun, Wenhao and Tu, Rong-Cheng and Ding, Yifu and Jin, Zhao and Liao, Jingyi and Liu, Shunyu and Tao, Dacheng},
title = {VORTA: Efficient Video Diffusion via Routing Sparse Attention},
journal = {arXiv preprint arXiv:2505.18809},
year = {2025}
}
- Downloads last month
- -
Model tree for Wenhao-Sun/VORTA
Base model
Wan-AI/Wan2.1-T2V-14B-Diffusers