---
language:
- en
base_model:
- THUDM/CogVideoX-5b
---
<div align="center">


<img src="icon.jpg" width="250"/>

<h2><center>Tora: Trajectory-oriented Diffusion Transformer for Video Generation</h2>

Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang

\* equal contribution

<a href='https://arxiv.org/abs/2407.21705'><img src='https://img.shields.io/badge/ArXiv-2407.21705-red'></a>
<a href='https://github.com/alibaba/Tora/'><img src='https://img.shields.io/badge/GitHub-Link-Blue'></a> 
<a href='https://www.modelscope.cn/studios/xiaoche/Tora'><img src='https://img.shields.io/badge/๐Ÿค–%20ModelScope-demo-blue'></a>


</div>

This is the official repository for paper "Tora: Trajectory-oriented Diffusion Transformer for Video Generation". 

## ๐Ÿ’ก Abstract

Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiTโ€™s scalability, allowing precise control of video contentโ€™s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Toraโ€™s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.

## ๐Ÿ“ฃ Updates

- `2024/10/23` ๐Ÿ”ฅ๐Ÿ”ฅOur [ModelScope Demo](https://www.modelscope.cn/studios/xiaoche/Tora) is launched. Welcome to try it out! We also upload the model weights to [ModelScope](https://www.modelscope.cn/models/xiaoche/Tora).
- `2024/10/21` Thanks to [@kijai](https://github.com/kijai) for supporting Tora in ComfyUI! [Link](https://github.com/kijai/ComfyUI-CogVideoXWrapper)
- `2024/10/15` ๐Ÿ”ฅ๐Ÿ”ฅWe released our inference code and model weights. **Please note that this is a CogVideoX version of Tora, built on the CogVideoX-5B model. This version of Tora is meant for academic research purposes only. Due to our commercial plans, we will not be open-sourcing the complete version of Tora at this time.**
- `2024/08/27` We released our v2 paper including appendix.
- `2024/07/31` We submitted our paper on arXiv and released our project page.

## ๐Ÿ“‘ Table of Contents

- [Showcases](#%EF%B8%8F-showcases)
- [Model Weights](#-model-weights)
- [Inference](#-inference)
- [Acknowledgements](#-acknowledgements)
- [Our previous work](#-our-previous-work)
- [Citation](#-citation)

## ๐ŸŽž๏ธ Showcases

All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip)





## ๐Ÿ“ฆ Model Weights

### Download Links
Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE)
- SDK 
```bash
from modelscope import snapshot_download
model_dir = snapshot_download('xiaoche/Tora')
```

- Git 
```bash
git clone https://www.modelscope.cn/xiaoche/Tora.git
```
## ๐Ÿ”„ Inference

please refer to our [Github](https://github.com/alibaba/Tora) or [modelscope online demo](https://www.modelscope.cn/studios/xiaoche/Tora)

### Recommendations for Text Prompts

For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.

You can refer to the following resources for guidance:

- [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py)
- [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py)



## ๐Ÿค Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:

- [CogVideo](https://github.com/THUDM/CogVideo): An open source video generation framework by THUKEG.
- [Open-Sora](https://github.com/hpcaitech/Open-Sora): An open source video generation framework by HPC-AI Tech.
- [MotionCtrl](https://github.com/TencentARC/MotionCtrl): A video generation model supporting motion control by ARC Lab, Tencent PCG.
- [ComfyUI-DragNUWA](https://github.com/chaojie/ComfyUI-DragNUWA): An implementation of DragNUWA for ComfyUI.

Special thanks to the contributors of these libraries for their hard work and dedication!

## ๐Ÿ“„ Our previous work

- [AnimateAnything: Fine Grained Open Domain Image Animation with Motion Guidance](https://github.com/alibaba/animate-anything)

## ๐Ÿ“š Citation

```bibtex
@misc{zhang2024toratrajectoryorienteddiffusiontransformer,
      title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation},
      author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
      year={2024},
      eprint={2407.21705},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.21705},
}
```