|
--- |
|
license: mit |
|
pipeline_tag: text-to-video |
|
library_name: pytorch |
|
--- |
|
|
|
# LanDiff |
|
|
|
<p align="center"> |
|
π¬ <a href="https://landiff.github.io/"><b>Demo Page</b></a>    ο½ |
|
  π€ <a href="https://huggingface.co/yinaoxiong/LanDiff">Hugging Face</a>   | |
|
  π€ <a href="https://www.modelscope.cn/models/yinaoxiong/LanDiff">ModelScope</a>   | |
|
   π <a href="https://arxiv.org/abs/2503.04606">Paper</a>    |
|
</p> |
|
<br> |
|
|
|
----- |
|
|
|
[**The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation**](https://arxiv.org/abs/2503.04606) |
|
|
|
In this repository, we present **LanDiff**, a novel text-to-video generation framework that synergizes the strengths of Language Models and Diffusion Models. **LanDiff** offers these key features: |
|
|
|
- π **High Performance**: **LanDiff** (5B) achieves a score of **85.43** on the VBench T2V benchmark, surpassing state-of-the-art open-source models like Hunyuan Video (13B) and demonstrating competitiveness with leading commercial models such as Sora, Kling, and Hailuo. It also achieves SOTA performance among open-source models for long video generation. |
|
- π **Novel Hybrid Architecture**: **LanDiff** pioneers a **coarse-to-fine** generation pipeline, integrating Language Models (for high-level semantics) and Diffusion Models (for high-fidelity visual details), effectively combining the advantages of both paradigms. |
|
- π **Extreme Compression Semantic Tokenizer**: Features an innovative video semantic tokenizer that compresses rich 3D visual features into compact 1D discrete representations using query tokens and frame grouping, achieving an exceptional **~14,000x compression ratio** while preserving crucial semantic information. |
|
- π **Efficient Long Video Generation**: Implements a **streaming diffusion model** capable of generating long videos chunk-by-chunk, significantly reducing computational requirements and enabling scalable video synthesis. |
|
|
|
|
|
## Quickstart |
|
|
|
### Prerequisites |
|
- Python 3.10 (validated) or higher |
|
- PyTorch 2.5 (validated) or higher |
|
|
|
### Installation |
|
#### Clone the repository |
|
```bash |
|
git clone https://github.com/LanDiff/LanDiff |
|
cd LanDiff |
|
``` |
|
#### Using UV |
|
```bash |
|
# Create environment |
|
uv sync |
|
# Install gradio for run local demo (Optional) |
|
uv sync --extra gradio |
|
``` |
|
#### Using Conda |
|
```bash |
|
# Create and activate Conda environment |
|
conda create -n landiff python=3.10 |
|
conda activate landiff |
|
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121 |
|
|
|
# Install dependencies |
|
pip install -r requirements.txt |
|
# Install gradio for run local demo (Optional) |
|
pip install gradio==5.27.0 |
|
``` |
|
|
|
## Model Download |
|
| Model | Download Link | Download Link | |
|
|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| |
|
| LanDiff | π€ [Huggingface](https://huggingface.co/yinaoxiong/LanDiff) | π€ [ModelScope](https://www.modelscope.cn/models/yinaoxiong/LanDiff) |
|
|
|
|
|
## License |
|
|
|
Code derived from CogVideo is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT). |
|
|
|
## Citation |
|
If you find our work helpful, please cite us. |
|
|
|
``` |
|
@article{landiff, |
|
title={The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation}, |
|
author={Yin, Aoxiong and Shen, Kai and Leng, Yichong and Tan, Xu and Zhou, Xinyu and Li, Juncheng and Tang, Siliang}, |
|
journal={arXiv preprint arXiv:2503.04606}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
## Acknowledgements |
|
|
|
We would like to thank the contributors to the [CogVideo](https://github.com/THUDM/CogVideo), [Theia](https://github.com/bdaiinstitute/theia), [TiTok](https://github.com/bytedance/1d-tokenizer), [flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl) and [HuggingFace](https://huggingface.co) repositories, for their open research. |