RaphaelLiu commited on
Commit
dd2910c
·
verified ·
1 Parent(s): bbcfa8c

Add files using upload-large-folder tool

Browse files
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - RaphaelLiu/PusaV1_training
5
+ base_model:
6
+ - Wan-AI/Wan2.2-T2V-A14B
7
+ tags:
8
+ - image-to-video
9
+ - start-end-frames
10
+ - text-to-video
11
+ - video-to-video
12
+ - video-extension
13
+ ---
14
+
15
+ # Pusa Wan2.2 V1.0 Model
16
+
17
+ [Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Project Page](https://yaofang-liu.github.io/Pusa_Web/) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV1_training) | [Wan2.1 Model](https://huggingface.co/RaphaelLiu/PusaV1) | [Paper (Pusa V1.0)](https://arxiv.org/abs/2507.16116) | [Paper (FVDM)](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)
18
+
19
+ ## Overview
20
+
21
+ **Pusa Wan2.2 V1.0** extends the groundbreaking Pusa paradigm to the advanced **Wan2.2-T2V-A14B** architecture, featuring a **MoE DiT design** with separate high-noise and low-noise models. This architecture provides enhanced quality control and generation capabilities while maintaining the revolutionary **vectorized timestep adaptation (VTA)** approach.
22
+
23
+ Building upon the success of Pusa V1.0, this Wan2.2 variant leverages the advanced MoE DiT architecture to achieve even better temporal modeling and video quality. The model supports **⚡ LightX2V acceleration** for ultra-fast 4-step inference while maintaining generation quality.
24
+
25
+ The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. **Pusa**¹ addresses these challenges through **vectorized timestep adaptation (VTA)**, enabling fine-grained temporal control within a unified video diffusion framework.
26
+
27
+ By adapting the SOTA Wan2.2-T2V-A14B model with VTA, we achieve unprecedented efficiency and enhanced quality through the dual DiT architecture. Pusa is not only capable of image-to-video (I2V) generation but also **unlocks many zero-shot multi-task capabilities such as start-end frames and video extension**—all without task-specific training.
28
+
29
+ ¹*Pusa (菩萨, /pu: 'sA:/) normally refers to "Thousand-Hand Guanyin" in Chinese, reflecting the iconography of many hands to symbolize her boundless compassion and ability. We use this name to indicate that our model uses many timestep variables to achieve numerous video generation capabilities, and we will fully open source it to let the community benefit from this tech.*
30
+
31
+ ## 🆕 Wan2.2 Enhancements
32
+
33
+ - **MoE DiT Architecture**: Separate high-noise and low-noise DiT models for enhanced quality control
34
+ - **⚡ LightX2V Support**: Ultra-fast 4-step inference with maintained quality
35
+
36
+
37
+ ## ✨ Key Features
38
+
39
+ - **Comprehensive Multi-task Support**:
40
+ - Image-to-Video
41
+ - Start-End Frames
42
+ - Video Completion
43
+ - Video Extension
44
+ - Text-to-Video
45
+ - Video Transition
46
+ - And more...
47
+
48
+
49
+ ## Installation and Usage
50
+
51
+ ### Download Weights and Setup
52
+
53
+ **Option 1**: Use the Hugging Face CLI:
54
+ ```shell
55
+ # Make sure you are in the PusaV1 directory
56
+ # Install huggingface-cli if you don't have it
57
+ pip install -U "huggingface_hub[cli]"
58
+ huggingface-cli download RaphaelLiu/Pusa-Wan2.2-V1 --local-dir ./model_zoo/PusaV1/Wan2.2-Models
59
+
60
+ # Download base Wan2.2 models if you don't have them
61
+ huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./model_zoo/PusaV1/Wan2.2-T2V-A14B
62
+ ```
63
+
64
+ **Option 2**: Download the LoRA checkpoints directly from [this Hugging Face repository](https://huggingface.co/RaphaelLiu/Pusa-Wan2.2-V1).
65
+
66
+ ### Usage Examples
67
+ Use with [Pusa Codebase](https://github.com/Yaofang-Liu/Pusa-VidGen).
68
+
69
+ **Standard Inference (10 or more steps, cfg_scale=3.0):**
70
+ ```shell
71
+ python examples/pusavideo/wan22_14b_multi_frames_pusa.py \
72
+ --image_paths "./demos/input_image.jpg" \
73
+ --prompt "Your prompt here" \
74
+ --cond_position "0" \
75
+ --noise_multipliers "0.2" \
76
+ --high_lora_path "./path/to/high_noise_lora.pt" \
77
+ --high_lora_alpha 1.5 \
78
+ --low_lora_path "./path/to/low_noise_lora.pt" \
79
+ --low_lora_alpha 1.4 \
80
+ --num_inference_steps 30 \
81
+ --cfg_scale 3.0
82
+ ```
83
+
84
+ **LightX2V Acceleration (4 steps, cfg_scale=1.0):**
85
+ ```shell
86
+ python examples/pusavideo/wan22_14b_multi_frames_pusa.py \
87
+ --image_paths "./demos/input_image.jpg" \
88
+ --prompt "Your prompt here" \
89
+ --cond_position "0" \
90
+ --noise_multipliers "0" \
91
+ --high_lora_path "./path/to/high_noise_lora.pt" \
92
+ --high_lora_alpha 1.5 \
93
+ --low_lora_path "./path/to/low_noise_lora.pt" \
94
+ --low_lora_alpha 1.4 \
95
+ --num_inference_steps 4 \
96
+ --cfg_scale 1 \
97
+ --lightx2v
98
+ ```
99
+
100
+ ### Key Parameters for Wan2.2
101
+
102
+ - **`--high_lora_path`**: Path to high-noise DiT LoRA checkpoint
103
+ - **`--low_lora_path`**: Path to low-noise DiT LoRA checkpoint
104
+ - **`--high_lora_alpha`**: LoRA alpha for high-noise model (recommended: 1.5)
105
+ - **`--low_lora_alpha`**: LoRA alpha for low-noise model (recommended: 1.4)
106
+ - **`--lightx2v`**: Enable LightX2V acceleration
107
+ - **`--cfg_scale`**: Use 1.0 for LightX2V, 3.0 for standard inference
108
+
109
+
110
+ ## Related Work
111
+
112
+ - [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa
113
+ - [Wan2.2-T2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B): The advanced dual DiT base model for this version
114
+ - [LightX2V](https://github.com/ModelTC/LightX2V): Acceleration technique for fast inference
115
+ - [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): Optimized LoRA implementation for efficient training
116
+
117
+ ## Citation
118
+
119
+ If you find our work useful in your research, please consider citing:
120
+
121
+ ```bibtex
122
+ @article{liu2025pusa,
123
+ title={PUSA V1. 0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation},
124
+ author={Liu, Yaofang and Ren, Yumeng and Artola, Aitor and Hu, Yuxuan and Cun, Xiaodong and Zhao, Xiaotong and Zhao, Alan and Chan, Raymond H and Zhang, Suiyun and Liu, Rui and others},
125
+ journal={arXiv preprint arXiv:2507.16116},
126
+ year={2025}
127
+ }
128
+ ```
129
+
130
+ ```bibtex
131
+ @article{liu2024redefining,
132
+ title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
133
+ author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
134
+ journal={arXiv preprint arXiv:2410.03160},
135
+ year={2024}
136
+ }
high_noise_pusa.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0219116fb8880de27f6b6a21d8ad9a16916bee7e7570d5f106b3158b4bfb98a
3
+ size 4907431368
low_noise_pusa.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bdb983ea8cd002821685b8371429e551d68a6c84a5d357a30cec7212e126831b
3
+ size 4907431368