LanDiff / README.md

Add/improve model card: metadata, links, and cleanup (#1)

30628aa verified 20 days ago

4.25 kB

	---
	license: mit
	pipeline_tag: text-to-video
	library_name: pytorch
	---

	# LanDiff

	<p align="center">
	🎬 <a href="https://landiff.github.io/"><b>Demo Page</b></a> &nbsp&nbsp ｜
	&nbsp&nbsp🤗 <a href="https://huggingface.co/yinaoxiong/LanDiff">Hugging Face</a>&nbsp&nbsp \|
	&nbsp&nbsp🤖 <a href="https://www.modelscope.cn/models/yinaoxiong/LanDiff">ModelScope</a>&nbsp&nbsp \|
	&nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2503.04606">Paper</a> &nbsp&nbsp
	</p>
	<br>

	-----

	[The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation](https://arxiv.org/abs/2503.04606)

	In this repository, we present LanDiff, a novel text-to-video generation framework that synergizes the strengths of Language Models and Diffusion Models. LanDiff offers these key features:

	- 👍 High Performance: LanDiff (5B) achieves a score of 85.43 on the VBench T2V benchmark, surpassing state-of-the-art open-source models like Hunyuan Video (13B) and demonstrating competitiveness with leading commercial models such as Sora, Kling, and Hailuo. It also achieves SOTA performance among open-source models for long video generation.
	- 👍 Novel Hybrid Architecture: LanDiff pioneers a coarse-to-fine generation pipeline, integrating Language Models (for high-level semantics) and Diffusion Models (for high-fidelity visual details), effectively combining the advantages of both paradigms.
	- 👍 Extreme Compression Semantic Tokenizer: Features an innovative video semantic tokenizer that compresses rich 3D visual features into compact 1D discrete representations using query tokens and frame grouping, achieving an exceptional ~14,000x compression ratio while preserving crucial semantic information.
	- 👍 Efficient Long Video Generation: Implements a streaming diffusion model capable of generating long videos chunk-by-chunk, significantly reducing computational requirements and enabling scalable video synthesis.


	## Quickstart

	### Prerequisites
	- Python 3.10 (validated) or higher
	- PyTorch 2.5 (validated) or higher

	### Installation
	#### Clone the repository
	```bash
	git clone https://github.com/LanDiff/LanDiff
	cd LanDiff
	```
	#### Using UV
	```bash
	# Create environment
	uv sync
	# Install gradio for run local demo (Optional)
	uv sync --extra gradio
	```
	#### Using Conda
	```bash
	# Create and activate Conda environment
	conda create -n landiff python=3.10
	conda activate landiff
	pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121

	# Install dependencies
	pip install -r requirements.txt
	# Install gradio for run local demo (Optional)
	pip install gradio==5.27.0
	```

	## Model Download
	\| Model \| Download Link \| Download Link \|
	\|--------------\|-----------------------------------------------------------------------------------------------------------------------------------------------------\|-------------------------------\|
	\| LanDiff \| 🤗 [Huggingface](https://huggingface.co/yinaoxiong/LanDiff) \| 🤖 [ModelScope](https://www.modelscope.cn/models/yinaoxiong/LanDiff)


	## License

	Code derived from CogVideo is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).

	## Citation
	If you find our work helpful, please cite us.

	```
	@article{landiff,
	title={The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation},
	author={Yin, Aoxiong and Shen, Kai and Leng, Yichong and Tan, Xu and Zhou, Xinyu and Li, Juncheng and Tang, Siliang},
	journal={arXiv preprint arXiv:2503.04606},
	year={2025}
	}
	```

	## Acknowledgements

	We would like to thank the contributors to the [CogVideo](https://github.com/THUDM/CogVideo), [Theia](https://github.com/bdaiinstitute/theia), [TiTok](https://github.com/bytedance/1d-tokenizer), [flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl) and [HuggingFace](https://huggingface.co) repositories, for their open research.