yinaoxiong nielsr HF Staff commited on
Commit
30628aa
ยท
verified ยท
1 Parent(s): c82ffdc

Add/improve model card: metadata, links, and cleanup (#1)

Browse files

- Add/improve model card: metadata, links, and cleanup (ece39d8bcdb4bf66036857bb43ae3677eae34d81)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +85 -3
README.md CHANGED
@@ -1,3 +1,85 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-to-video
4
+ library_name: pytorch
5
+ ---
6
+
7
+ # LanDiff
8
+
9
+ <p align="center">
10
+ ๐ŸŽฌ <a href="https://landiff.github.io/"><b>Demo Page</b></a> &nbsp&nbsp ๏ฝœ
11
+ &nbsp&nbsp๐Ÿค— <a href="https://huggingface.co/yinaoxiong/LanDiff">Hugging Face</a>&nbsp&nbsp |
12
+ &nbsp&nbsp๐Ÿค– <a href="https://www.modelscope.cn/models/yinaoxiong/LanDiff">ModelScope</a>&nbsp&nbsp |
13
+ &nbsp&nbsp ๐Ÿ“‘ <a href="https://arxiv.org/abs/2503.04606">Paper</a> &nbsp&nbsp
14
+ </p>
15
+ <br>
16
+
17
+ -----
18
+
19
+ [**The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation**](https://arxiv.org/abs/2503.04606)
20
+
21
+ In this repository, we present **LanDiff**, a novel text-to-video generation framework that synergizes the strengths of Language Models and Diffusion Models. **LanDiff** offers these key features:
22
+
23
+ - ๐Ÿ‘ **High Performance**: **LanDiff** (5B) achieves a score of **85.43** on the VBench T2V benchmark, surpassing state-of-the-art open-source models like Hunyuan Video (13B) and demonstrating competitiveness with leading commercial models such as Sora, Kling, and Hailuo. It also achieves SOTA performance among open-source models for long video generation.
24
+ - ๐Ÿ‘ **Novel Hybrid Architecture**: **LanDiff** pioneers a **coarse-to-fine** generation pipeline, integrating Language Models (for high-level semantics) and Diffusion Models (for high-fidelity visual details), effectively combining the advantages of both paradigms.
25
+ - ๐Ÿ‘ **Extreme Compression Semantic Tokenizer**: Features an innovative video semantic tokenizer that compresses rich 3D visual features into compact 1D discrete representations using query tokens and frame grouping, achieving an exceptional **~14,000x compression ratio** while preserving crucial semantic information.
26
+ - ๐Ÿ‘ **Efficient Long Video Generation**: Implements a **streaming diffusion model** capable of generating long videos chunk-by-chunk, significantly reducing computational requirements and enabling scalable video synthesis.
27
+
28
+
29
+ ## Quickstart
30
+
31
+ ### Prerequisites
32
+ - Python 3.10 (validated) or higher
33
+ - PyTorch 2.5 (validated) or higher
34
+
35
+ ### Installation
36
+ #### Clone the repository
37
+ ```bash
38
+ git clone https://github.com/LanDiff/LanDiff
39
+ cd LanDiff
40
+ ```
41
+ #### Using UV
42
+ ```bash
43
+ # Create environment
44
+ uv sync
45
+ # Install gradio for run local demo (Optional)
46
+ uv sync --extra gradio
47
+ ```
48
+ #### Using Conda
49
+ ```bash
50
+ # Create and activate Conda environment
51
+ conda create -n landiff python=3.10
52
+ conda activate landiff
53
+ pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
54
+
55
+ # Install dependencies
56
+ pip install -r requirements.txt
57
+ # Install gradio for run local demo (Optional)
58
+ pip install gradio==5.27.0
59
+ ```
60
+
61
+ ## Model Download
62
+ | Model | Download Link | Download Link |
63
+ |--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
64
+ | LanDiff | ๐Ÿค— [Huggingface](https://huggingface.co/yinaoxiong/LanDiff) | ๐Ÿค– [ModelScope](https://www.modelscope.cn/models/yinaoxiong/LanDiff)
65
+
66
+
67
+ ## License
68
+
69
+ Code derived from CogVideo is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).
70
+
71
+ ## Citation
72
+ If you find our work helpful, please cite us.
73
+
74
+ ```
75
+ @article{landiff,
76
+ title={The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation},
77
+ author={Yin, Aoxiong and Shen, Kai and Leng, Yichong and Tan, Xu and Zhou, Xinyu and Li, Juncheng and Tang, Siliang},
78
+ journal={arXiv preprint arXiv:2503.04606},
79
+ year={2025}
80
+ }
81
+ ```
82
+
83
+ ## Acknowledgements
84
+
85
+ We would like to thank the contributors to the [CogVideo](https://github.com/THUDM/CogVideo), [Theia](https://github.com/bdaiinstitute/theia), [TiTok](https://github.com/bytedance/1d-tokenizer), [flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl) and [HuggingFace](https://huggingface.co) repositories, for their open research.