File size: 4,253 Bytes
30628aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
license: mit
pipeline_tag: text-to-video
library_name: pytorch
---

# LanDiff

<p align="center">
    🎬 <a href="https://landiff.github.io/"><b>Demo Page</b></a> &nbsp&nbsp | 
    &nbsp&nbspπŸ€— <a href="https://huggingface.co/yinaoxiong/LanDiff">Hugging Face</a>&nbsp&nbsp | 
    &nbsp&nbspπŸ€– <a href="https://www.modelscope.cn/models/yinaoxiong/LanDiff">ModelScope</a>&nbsp&nbsp | 
    &nbsp&nbsp πŸ“‘ <a href="https://arxiv.org/abs/2503.04606">Paper</a> &nbsp&nbsp
</p>
<br>

-----

[**The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation**](https://arxiv.org/abs/2503.04606)

In this repository, we present **LanDiff**, a novel text-to-video generation framework that synergizes the strengths of Language Models and Diffusion Models. **LanDiff** offers these key features:

- πŸ‘ **High Performance**: **LanDiff** (5B) achieves a score of **85.43** on the VBench T2V benchmark, surpassing state-of-the-art open-source models like Hunyuan Video (13B) and demonstrating competitiveness with leading commercial models such as Sora, Kling, and Hailuo. It also achieves SOTA performance among open-source models for long video generation.
- πŸ‘ **Novel Hybrid Architecture**: **LanDiff** pioneers a **coarse-to-fine** generation pipeline, integrating Language Models (for high-level semantics) and Diffusion Models (for high-fidelity visual details), effectively combining the advantages of both paradigms.
- πŸ‘ **Extreme Compression Semantic Tokenizer**: Features an innovative video semantic tokenizer that compresses rich 3D visual features into compact 1D discrete representations using query tokens and frame grouping, achieving an exceptional **~14,000x compression ratio** while preserving crucial semantic information.
- πŸ‘ **Efficient Long Video Generation**: Implements a **streaming diffusion model** capable of generating long videos chunk-by-chunk, significantly reducing computational requirements and enabling scalable video synthesis.


## Quickstart

### Prerequisites
- Python 3.10 (validated) or higher
- PyTorch 2.5 (validated) or higher

### Installation
#### Clone the repository
```bash
git clone https://github.com/LanDiff/LanDiff
cd LanDiff
```
#### Using UV
```bash
# Create environment
uv sync
# Install gradio for run local demo (Optional)
uv sync --extra gradio
```
#### Using Conda
```bash
# Create and activate Conda environment
conda create -n landiff python=3.10
conda activate landiff
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt
# Install gradio for run local demo (Optional)
pip install gradio==5.27.0
```

## Model Download
| Model       | Download Link                                                                                                                                       |           Download Link               |
|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
| LanDiff      | πŸ€— [Huggingface](https://huggingface.co/yinaoxiong/LanDiff)               | πŸ€– [ModelScope](https://www.modelscope.cn/models/yinaoxiong/LanDiff)


## License

Code derived from CogVideo is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).

## Citation
If you find our work helpful, please cite us.

```
@article{landiff,
  title={The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation},
  author={Yin, Aoxiong and Shen, Kai and Leng, Yichong and Tan, Xu and Zhou, Xinyu and Li, Juncheng and Tang, Siliang},
  journal={arXiv preprint arXiv:2503.04606},
  year={2025}
}
```

## Acknowledgements

We would like to thank the contributors to the [CogVideo](https://github.com/THUDM/CogVideo), [Theia](https://github.com/bdaiinstitute/theia), [TiTok](https://github.com/bytedance/1d-tokenizer), [flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl) and [HuggingFace](https://huggingface.co) repositories, for their open research.