Tora / README.md
jbilcke-hf's picture
jbilcke-hf HF staff
Update README.md
975218c verified
---
language:
- en
base_model:
- THUDM/CogVideoX-5b
---
<div align="center">
<img src="icon.jpg" width="250"/>
<h2><center>Tora: Trajectory-oriented Diffusion Transformer for Video Generation</h2>
Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
\* equal contribution
<a href='https://arxiv.org/abs/2407.21705'><img src='https://img.shields.io/badge/ArXiv-2407.21705-red'></a>
<a href='https://github.com/alibaba/Tora/'><img src='https://img.shields.io/badge/GitHub-Link-Blue'></a>
<a href='https://www.modelscope.cn/studios/xiaoche/Tora'><img src='https://img.shields.io/badge/๐Ÿค–%20ModelScope-demo-blue'></a>
</div>
This is the official repository for paper "Tora: Trajectory-oriented Diffusion Transformer for Video Generation".
## ๐Ÿ’ก Abstract
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiTโ€™s scalability, allowing precise control of video contentโ€™s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Toraโ€™s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.
## ๐Ÿ“ฃ Updates
- `2024/10/23` ๐Ÿ”ฅ๐Ÿ”ฅOur [ModelScope Demo](https://www.modelscope.cn/studios/xiaoche/Tora) is launched. Welcome to try it out! We also upload the model weights to [ModelScope](https://www.modelscope.cn/models/xiaoche/Tora).
- `2024/10/21` Thanks to [@kijai](https://github.com/kijai) for supporting Tora in ComfyUI! [Link](https://github.com/kijai/ComfyUI-CogVideoXWrapper)
- `2024/10/15` ๐Ÿ”ฅ๐Ÿ”ฅWe released our inference code and model weights. **Please note that this is a CogVideoX version of Tora, built on the CogVideoX-5B model. This version of Tora is meant for academic research purposes only. Due to our commercial plans, we will not be open-sourcing the complete version of Tora at this time.**
- `2024/08/27` We released our v2 paper including appendix.
- `2024/07/31` We submitted our paper on arXiv and released our project page.
## ๐Ÿ“‘ Table of Contents
- [Showcases](#%EF%B8%8F-showcases)
- [Model Weights](#-model-weights)
- [Inference](#-inference)
- [Acknowledgements](#-acknowledgements)
- [Our previous work](#-our-previous-work)
- [Citation](#-citation)
## ๐ŸŽž๏ธ Showcases
All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip)
## ๐Ÿ“ฆ Model Weights
### Download Links
Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE)
- SDK
```bash
from modelscope import snapshot_download
model_dir = snapshot_download('xiaoche/Tora')
```
- Git
```bash
git clone https://www.modelscope.cn/xiaoche/Tora.git
```
## ๐Ÿ”„ Inference
please refer to our [Github](https://github.com/alibaba/Tora) or [modelscope online demo](https://www.modelscope.cn/studios/xiaoche/Tora)
### Recommendations for Text Prompts
For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.
You can refer to the following resources for guidance:
- [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py)
- [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py)
## ๐Ÿค Acknowledgements
We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
- [CogVideo](https://github.com/THUDM/CogVideo): An open source video generation framework by THUKEG.
- [Open-Sora](https://github.com/hpcaitech/Open-Sora): An open source video generation framework by HPC-AI Tech.
- [MotionCtrl](https://github.com/TencentARC/MotionCtrl): A video generation model supporting motion control by ARC Lab, Tencent PCG.
- [ComfyUI-DragNUWA](https://github.com/chaojie/ComfyUI-DragNUWA): An implementation of DragNUWA for ComfyUI.
Special thanks to the contributors of these libraries for their hard work and dedication!
## ๐Ÿ“„ Our previous work
- [AnimateAnything: Fine Grained Open Domain Image Animation with Motion Guidance](https://github.com/alibaba/animate-anything)
## ๐Ÿ“š Citation
```bibtex
@misc{zhang2024toratrajectoryorienteddiffusiontransformer,
title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation},
author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
year={2024},
eprint={2407.21705},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.21705},
}
```