Add/improve model card: metadata, links, and cleanup (#1)
Browse files- Add/improve model card: metadata, links, and cleanup (ece39d8bcdb4bf66036857bb43ae3677eae34d81)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,3 +1,85 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
pipeline_tag: text-to-video
|
4 |
+
library_name: pytorch
|
5 |
+
---
|
6 |
+
|
7 |
+
# LanDiff
|
8 |
+
|
9 |
+
<p align="center">
|
10 |
+
๐ฌ <a href="https://landiff.github.io/"><b>Demo Page</b></a>    ๏ฝ
|
11 |
+
  ๐ค <a href="https://huggingface.co/yinaoxiong/LanDiff">Hugging Face</a>   |
|
12 |
+
  ๐ค <a href="https://www.modelscope.cn/models/yinaoxiong/LanDiff">ModelScope</a>   |
|
13 |
+
   ๐ <a href="https://arxiv.org/abs/2503.04606">Paper</a>   
|
14 |
+
</p>
|
15 |
+
<br>
|
16 |
+
|
17 |
+
-----
|
18 |
+
|
19 |
+
[**The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation**](https://arxiv.org/abs/2503.04606)
|
20 |
+
|
21 |
+
In this repository, we present **LanDiff**, a novel text-to-video generation framework that synergizes the strengths of Language Models and Diffusion Models. **LanDiff** offers these key features:
|
22 |
+
|
23 |
+
- ๐ **High Performance**: **LanDiff** (5B) achieves a score of **85.43** on the VBench T2V benchmark, surpassing state-of-the-art open-source models like Hunyuan Video (13B) and demonstrating competitiveness with leading commercial models such as Sora, Kling, and Hailuo. It also achieves SOTA performance among open-source models for long video generation.
|
24 |
+
- ๐ **Novel Hybrid Architecture**: **LanDiff** pioneers a **coarse-to-fine** generation pipeline, integrating Language Models (for high-level semantics) and Diffusion Models (for high-fidelity visual details), effectively combining the advantages of both paradigms.
|
25 |
+
- ๐ **Extreme Compression Semantic Tokenizer**: Features an innovative video semantic tokenizer that compresses rich 3D visual features into compact 1D discrete representations using query tokens and frame grouping, achieving an exceptional **~14,000x compression ratio** while preserving crucial semantic information.
|
26 |
+
- ๐ **Efficient Long Video Generation**: Implements a **streaming diffusion model** capable of generating long videos chunk-by-chunk, significantly reducing computational requirements and enabling scalable video synthesis.
|
27 |
+
|
28 |
+
|
29 |
+
## Quickstart
|
30 |
+
|
31 |
+
### Prerequisites
|
32 |
+
- Python 3.10 (validated) or higher
|
33 |
+
- PyTorch 2.5 (validated) or higher
|
34 |
+
|
35 |
+
### Installation
|
36 |
+
#### Clone the repository
|
37 |
+
```bash
|
38 |
+
git clone https://github.com/LanDiff/LanDiff
|
39 |
+
cd LanDiff
|
40 |
+
```
|
41 |
+
#### Using UV
|
42 |
+
```bash
|
43 |
+
# Create environment
|
44 |
+
uv sync
|
45 |
+
# Install gradio for run local demo (Optional)
|
46 |
+
uv sync --extra gradio
|
47 |
+
```
|
48 |
+
#### Using Conda
|
49 |
+
```bash
|
50 |
+
# Create and activate Conda environment
|
51 |
+
conda create -n landiff python=3.10
|
52 |
+
conda activate landiff
|
53 |
+
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
|
54 |
+
|
55 |
+
# Install dependencies
|
56 |
+
pip install -r requirements.txt
|
57 |
+
# Install gradio for run local demo (Optional)
|
58 |
+
pip install gradio==5.27.0
|
59 |
+
```
|
60 |
+
|
61 |
+
## Model Download
|
62 |
+
| Model | Download Link | Download Link |
|
63 |
+
|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
|
64 |
+
| LanDiff | ๐ค [Huggingface](https://huggingface.co/yinaoxiong/LanDiff) | ๐ค [ModelScope](https://www.modelscope.cn/models/yinaoxiong/LanDiff)
|
65 |
+
|
66 |
+
|
67 |
+
## License
|
68 |
+
|
69 |
+
Code derived from CogVideo is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).
|
70 |
+
|
71 |
+
## Citation
|
72 |
+
If you find our work helpful, please cite us.
|
73 |
+
|
74 |
+
```
|
75 |
+
@article{landiff,
|
76 |
+
title={The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation},
|
77 |
+
author={Yin, Aoxiong and Shen, Kai and Leng, Yichong and Tan, Xu and Zhou, Xinyu and Li, Juncheng and Tang, Siliang},
|
78 |
+
journal={arXiv preprint arXiv:2503.04606},
|
79 |
+
year={2025}
|
80 |
+
}
|
81 |
+
```
|
82 |
+
|
83 |
+
## Acknowledgements
|
84 |
+
|
85 |
+
We would like to thank the contributors to the [CogVideo](https://github.com/THUDM/CogVideo), [Theia](https://github.com/bdaiinstitute/theia), [TiTok](https://github.com/bytedance/1d-tokenizer), [flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl) and [HuggingFace](https://huggingface.co) repositories, for their open research.
|