VORTA: Efficient Video Diffusion via Routing Sparse Attention

\ud83d\udcda Paper | \ud83d\udcbb Code

NeurIPS '25 - VORTA accelerates video diffusion transformers by sparse attention and dynamic routing, achieving speedup with negligible quality loss.

Abstract

Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.

Installation

Install Pytorch. We have tested the code with PyTorch 2.6.0 and CUDA 12.6, but it should work with other versions as well. You can install PyTorch using the following command:

pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126

Install the dependencies:

python -m pip install -r requirements.txt

Sample Usage (Inference)

We use the general scripts to demonstrate the usage of our method. You can find the detailed scripts for each model in the scripts folder of the VORTA GitHub repository:

HunyuanVideo: scripts/hunyuan/inference.sh
Wan 2.1: scripts/wan/inference.sh

First, download the ready-to-use router weights. Assuming this repository is cloned as VORTA from the GitHub repository:

git lfs install
git clone [email protected]:Wenhao-Sun/VORTA
# mv VORTA/<model_name> results/, <model_name>: wan-14B, hunyuan; e.g.
mv VORTA/wan-14B results/

Run the video DiTs with VORTA for acceleration (example for wan model):

CUDA_VISIBLE_DEVICES=0 python scripts/wan/inference.py \
    --pretrained_model_path Wan-AI/Wan2.1-T2V-14B-Diffusers \
    --val_data_json_file prompt.json \
    --output_dir results/wan-14B/vorta \
    --resume_dir results/wan-14B/train \
    --resume ckpt/step-000100 \
    --enable_cpu_offload \
    --seed 1234

For the hunyuan model, replace wan with hunyuan in the script path and output directory, and use hunyuanvideo-community/HunyuanVideo as the --pretrained_model_path.

You can edit the prompts.json or the --val_data_json_file option to change the text prompt. See the source code scripts/<model_name>/inference.py or use python scripts/<model_name>/inference.py --help command for more detailed explanations of the arguments.

Acknowledgements

Thanks to the authors of the following repositories for their great works and open-sourcing the code and models: Diffusers, HunyuanVideo, Wan 2.1, FastVideo

Citation

If you find our work helpful or inspiring, please feel free to cite it.

@article{wenhao_2025_vorta,
  author = {Sun, Wenhao and Tu, Rong-Cheng and Ding, Yifu and Jin, Zhao and Liao, Jingyi and Liu, Shunyu and Tao, Dacheng},
  title = {VORTA: Efficient Video Diffusion via Routing Sparse Attention},
  journal = {arXiv preprint arXiv:2505.18809},
  year = {2025}
}

Downloads last month: -

Model tree for Wenhao-Sun/VORTA

Base model

Wan-AI/Wan2.1-T2V-14B-Diffusers

Finetuned

(3)

this model