Add files using upload-large-folder tool
Browse files- README.md +136 -3
- high_noise_pusa.safetensors +3 -0
- low_noise_pusa.safetensors +3 -0
README.md
CHANGED
@@ -1,3 +1,136 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- RaphaelLiu/PusaV1_training
|
5 |
+
base_model:
|
6 |
+
- Wan-AI/Wan2.2-T2V-A14B
|
7 |
+
tags:
|
8 |
+
- image-to-video
|
9 |
+
- start-end-frames
|
10 |
+
- text-to-video
|
11 |
+
- video-to-video
|
12 |
+
- video-extension
|
13 |
+
---
|
14 |
+
|
15 |
+
# Pusa Wan2.2 V1.0 Model
|
16 |
+
|
17 |
+
[Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Project Page](https://yaofang-liu.github.io/Pusa_Web/) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV1_training) | [Wan2.1 Model](https://huggingface.co/RaphaelLiu/PusaV1) | [Paper (Pusa V1.0)](https://arxiv.org/abs/2507.16116) | [Paper (FVDM)](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)
|
18 |
+
|
19 |
+
## Overview
|
20 |
+
|
21 |
+
**Pusa Wan2.2 V1.0** extends the groundbreaking Pusa paradigm to the advanced **Wan2.2-T2V-A14B** architecture, featuring a **MoE DiT design** with separate high-noise and low-noise models. This architecture provides enhanced quality control and generation capabilities while maintaining the revolutionary **vectorized timestep adaptation (VTA)** approach.
|
22 |
+
|
23 |
+
Building upon the success of Pusa V1.0, this Wan2.2 variant leverages the advanced MoE DiT architecture to achieve even better temporal modeling and video quality. The model supports **⚡ LightX2V acceleration** for ultra-fast 4-step inference while maintaining generation quality.
|
24 |
+
|
25 |
+
The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. **Pusa**¹ addresses these challenges through **vectorized timestep adaptation (VTA)**, enabling fine-grained temporal control within a unified video diffusion framework.
|
26 |
+
|
27 |
+
By adapting the SOTA Wan2.2-T2V-A14B model with VTA, we achieve unprecedented efficiency and enhanced quality through the dual DiT architecture. Pusa is not only capable of image-to-video (I2V) generation but also **unlocks many zero-shot multi-task capabilities such as start-end frames and video extension**—all without task-specific training.
|
28 |
+
|
29 |
+
¹*Pusa (菩萨, /pu: 'sA:/) normally refers to "Thousand-Hand Guanyin" in Chinese, reflecting the iconography of many hands to symbolize her boundless compassion and ability. We use this name to indicate that our model uses many timestep variables to achieve numerous video generation capabilities, and we will fully open source it to let the community benefit from this tech.*
|
30 |
+
|
31 |
+
## 🆕 Wan2.2 Enhancements
|
32 |
+
|
33 |
+
- **MoE DiT Architecture**: Separate high-noise and low-noise DiT models for enhanced quality control
|
34 |
+
- **⚡ LightX2V Support**: Ultra-fast 4-step inference with maintained quality
|
35 |
+
|
36 |
+
|
37 |
+
## ✨ Key Features
|
38 |
+
|
39 |
+
- **Comprehensive Multi-task Support**:
|
40 |
+
- Image-to-Video
|
41 |
+
- Start-End Frames
|
42 |
+
- Video Completion
|
43 |
+
- Video Extension
|
44 |
+
- Text-to-Video
|
45 |
+
- Video Transition
|
46 |
+
- And more...
|
47 |
+
|
48 |
+
|
49 |
+
## Installation and Usage
|
50 |
+
|
51 |
+
### Download Weights and Setup
|
52 |
+
|
53 |
+
**Option 1**: Use the Hugging Face CLI:
|
54 |
+
```shell
|
55 |
+
# Make sure you are in the PusaV1 directory
|
56 |
+
# Install huggingface-cli if you don't have it
|
57 |
+
pip install -U "huggingface_hub[cli]"
|
58 |
+
huggingface-cli download RaphaelLiu/Pusa-Wan2.2-V1 --local-dir ./model_zoo/PusaV1/Wan2.2-Models
|
59 |
+
|
60 |
+
# Download base Wan2.2 models if you don't have them
|
61 |
+
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./model_zoo/PusaV1/Wan2.2-T2V-A14B
|
62 |
+
```
|
63 |
+
|
64 |
+
**Option 2**: Download the LoRA checkpoints directly from [this Hugging Face repository](https://huggingface.co/RaphaelLiu/Pusa-Wan2.2-V1).
|
65 |
+
|
66 |
+
### Usage Examples
|
67 |
+
Use with [Pusa Codebase](https://github.com/Yaofang-Liu/Pusa-VidGen).
|
68 |
+
|
69 |
+
**Standard Inference (10 or more steps, cfg_scale=3.0):**
|
70 |
+
```shell
|
71 |
+
python examples/pusavideo/wan22_14b_multi_frames_pusa.py \
|
72 |
+
--image_paths "./demos/input_image.jpg" \
|
73 |
+
--prompt "Your prompt here" \
|
74 |
+
--cond_position "0" \
|
75 |
+
--noise_multipliers "0.2" \
|
76 |
+
--high_lora_path "./path/to/high_noise_lora.pt" \
|
77 |
+
--high_lora_alpha 1.5 \
|
78 |
+
--low_lora_path "./path/to/low_noise_lora.pt" \
|
79 |
+
--low_lora_alpha 1.4 \
|
80 |
+
--num_inference_steps 30 \
|
81 |
+
--cfg_scale 3.0
|
82 |
+
```
|
83 |
+
|
84 |
+
**LightX2V Acceleration (4 steps, cfg_scale=1.0):**
|
85 |
+
```shell
|
86 |
+
python examples/pusavideo/wan22_14b_multi_frames_pusa.py \
|
87 |
+
--image_paths "./demos/input_image.jpg" \
|
88 |
+
--prompt "Your prompt here" \
|
89 |
+
--cond_position "0" \
|
90 |
+
--noise_multipliers "0" \
|
91 |
+
--high_lora_path "./path/to/high_noise_lora.pt" \
|
92 |
+
--high_lora_alpha 1.5 \
|
93 |
+
--low_lora_path "./path/to/low_noise_lora.pt" \
|
94 |
+
--low_lora_alpha 1.4 \
|
95 |
+
--num_inference_steps 4 \
|
96 |
+
--cfg_scale 1 \
|
97 |
+
--lightx2v
|
98 |
+
```
|
99 |
+
|
100 |
+
### Key Parameters for Wan2.2
|
101 |
+
|
102 |
+
- **`--high_lora_path`**: Path to high-noise DiT LoRA checkpoint
|
103 |
+
- **`--low_lora_path`**: Path to low-noise DiT LoRA checkpoint
|
104 |
+
- **`--high_lora_alpha`**: LoRA alpha for high-noise model (recommended: 1.5)
|
105 |
+
- **`--low_lora_alpha`**: LoRA alpha for low-noise model (recommended: 1.4)
|
106 |
+
- **`--lightx2v`**: Enable LightX2V acceleration
|
107 |
+
- **`--cfg_scale`**: Use 1.0 for LightX2V, 3.0 for standard inference
|
108 |
+
|
109 |
+
|
110 |
+
## Related Work
|
111 |
+
|
112 |
+
- [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa
|
113 |
+
- [Wan2.2-T2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B): The advanced dual DiT base model for this version
|
114 |
+
- [LightX2V](https://github.com/ModelTC/LightX2V): Acceleration technique for fast inference
|
115 |
+
- [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): Optimized LoRA implementation for efficient training
|
116 |
+
|
117 |
+
## Citation
|
118 |
+
|
119 |
+
If you find our work useful in your research, please consider citing:
|
120 |
+
|
121 |
+
```bibtex
|
122 |
+
@article{liu2025pusa,
|
123 |
+
title={PUSA V1. 0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation},
|
124 |
+
author={Liu, Yaofang and Ren, Yumeng and Artola, Aitor and Hu, Yuxuan and Cun, Xiaodong and Zhao, Xiaotong and Zhao, Alan and Chan, Raymond H and Zhang, Suiyun and Liu, Rui and others},
|
125 |
+
journal={arXiv preprint arXiv:2507.16116},
|
126 |
+
year={2025}
|
127 |
+
}
|
128 |
+
```
|
129 |
+
|
130 |
+
```bibtex
|
131 |
+
@article{liu2024redefining,
|
132 |
+
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
|
133 |
+
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
|
134 |
+
journal={arXiv preprint arXiv:2410.03160},
|
135 |
+
year={2024}
|
136 |
+
}
|
high_noise_pusa.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f0219116fb8880de27f6b6a21d8ad9a16916bee7e7570d5f106b3158b4bfb98a
|
3 |
+
size 4907431368
|
low_noise_pusa.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bdb983ea8cd002821685b8371429e551d68a6c84a5d357a30cec7212e126831b
|
3 |
+
size 4907431368
|