Image-to-Video
phantom
TianxiangMa commited on
Commit
6805a5a
·
1 Parent(s): 4d582e4

update README

Browse files
README.md CHANGED
@@ -24,12 +24,18 @@ library_name: phantom
24
  <p>
25
 
26
  ## 🔥 Latest News!
27
- * Apr 10, 2025: We have updated the full version of the Phantom paper, which now includes more detailed descriptions of the model architecture and dataset pipeline.
28
- * Apr 20, 2025: 👋 Phantom-Wan is coming! We adapted the Phantom framework into the [Wan2.1](https://github.com/Wan-Video/Wan2.1) video generation model. The inference codes and checkpoint have been released.
 
 
 
 
29
 
30
  ## 📑 Todo List
31
- - [x] Inference codes and Checkpoint of Phantom-Wan 1.3B
32
- - [ ] Checkpoint of Phantom-Wan 14B
 
 
33
  - [ ] Training codes of Phantom-Wan
34
 
35
  ## 📖 Overview
@@ -51,29 +57,38 @@ pip install -r requirements.txt
51
  ```
52
 
53
  ### Model Download
54
- First you need to download the 1.3B original model of Wan2.1. Download Wan2.1-1.3B using huggingface-cli:
 
 
 
 
 
55
  ``` sh
56
  pip install "huggingface_hub[cli]"
57
  huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
58
  ```
59
- Then download the Phantom-Wan-1.3B model:
 
60
  ``` sh
61
- huggingface-cli download xxx --local-dir ./Phantom-Wan-1.3B
62
  ```
 
63
 
64
  ### Run Subject-to-Video Generation
65
 
 
 
66
  - Single-GPU inference
67
 
68
  ``` sh
69
- python generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-1.3B/Phantom-Wan-1.3B.pth --ref_image "examples/ref1.png,examples/ref2.png" --prompt "暖阳漫过草地,扎着双马尾、头戴绿色蝴蝶结、身穿浅绿色连衣裙的小女孩蹲在盛开的雏菊旁。她身旁一只棕白相间的狗狗吐着舌头,毛茸茸尾巴欢快摇晃。小女孩笑着举起黄红配色、带有蓝色按钮的玩具相机,将和狗狗的欢乐瞬间定格。" --base_seed 42
70
  ```
71
 
72
  - Multi-GPU inference using FSDP + xDiT USP
73
 
74
  ``` sh
75
  pip install "xfuser>=0.4.1"
76
- torchrun --nproc_per_node=8 generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-1.3B/Phantom-Wan-1.3B.pth --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 4 --ring_size 2 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42
77
  ```
78
 
79
  > 💡Note:
@@ -83,24 +98,129 @@ torchrun --nproc_per_node=8 generate.py --task s2v-1.3B --size 832*480 --ckpt_di
83
 
84
  For inferencing examples, please refer to "infer.sh". You will get the following generated results:
85
 
86
- <table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  <tr>
88
- <td><img src="./assets/result1.gif" alt="GIF 1" width="400"></td>
89
- <td><img src="./assets/result2.gif" alt="GIF 2" width="400"></td>
 
 
 
 
 
 
90
  </tr>
 
91
  <tr>
92
- <td><img src="./assets/result3.gif" alt="GIF 3" width="400"></td>
93
- <td><img src="./assets/result4.gif" alt="GIF 4" width="400"></td>
 
 
 
 
 
 
 
94
  </tr>
95
  </table>
96
 
97
- ## 🆚 Comparative Results
98
- - **Identity Preserving Video Generation**.
99
- ![image](./assets/id_eval.png)
100
- - **Single Reference Subject-to-Video Generation**.
101
- ![image](./assets/ip_eval_s.png)
102
- - **Multi-Reference Subject-to-Video Generation**.
103
- ![image](./assets/ip_eval_m_00.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  ## Acknowledgements
106
  We would like to express our gratitude to the SEED team for their support. Special thanks to Lu Jiang, Haoyuan Guo, Zhibei Ma, and Sen Wang for their assistance with the model and data. In addition, we are also very grateful to Siying Chen, Qingyang Li, and Wei Han for their help with the evaluation.
 
24
  <p>
25
 
26
  ## 🔥 Latest News!
27
+ * May 27, 2025: 🎉 We have released the Phantom-Wan-14B model, a more powerful Subject-to-Video generation model.
28
+ * Apr 23, 2025: 😊 Thanks to [ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper/tree/dev) for adapting ComfyUI to Phantom-Wan-1.3B. Everyone is welcome to use it!
29
+ * Apr 21, 2025: 👋 Phantom-Wan is coming! We adapted the Phantom framework into the [Wan2.1](https://github.com/Wan-Video/Wan2.1) video generation model. The inference codes and checkpoint have been released.
30
+ * Apr 10, 2025: We have updated the [full version](https://arxiv.org/pdf/2502.11079v2) of the Phantom paper, which now includes more detailed descriptions of the model architecture and dataset pipeline.
31
+ * Feb 16, 2025: We proposed a novel subject-consistent video generation model, **Phantom**, and have released the [report](https://arxiv.org/pdf/2502.11079v1) publicly. For more video demos, please visit the [project page](https://phantom-video.github.io/Phantom/).
32
+
33
 
34
  ## 📑 Todo List
35
+ - [x] Inference codes and Checkpoint of Phantom-Wan-1.3B
36
+ - [x] Checkpoint of Phantom-Wan-14B
37
+ - [ ] Checkpoint of Phantom-Wan-14B Pro
38
+ - [ ] Open source Phantom-Data
39
  - [ ] Training codes of Phantom-Wan
40
 
41
  ## 📖 Overview
 
57
  ```
58
 
59
  ### Model Download
60
+ | Models | Download Link | Notes |
61
+ |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
62
+ | Phantom-Wan-1.3B | 🤗 [Huggingface](https://huggingface.co/bytedance-research/Phantom/blob/main/Phantom-Wan-1.3B.pth) | Supports both 480P and 720P
63
+ | Phantom-Wan-14B | 🤗 [Huggingface](https://huggingface.co/bytedance-research/Phantom/tree/main) | Supports both 480P and 720P
64
+
65
+ First you need to download the 1.3B original model of Wan2.1, since our Phantom-Wan model relies on the Wan2.1 VAE and Text Encoder model. Download Wan2.1-1.3B using huggingface-cli:
66
  ``` sh
67
  pip install "huggingface_hub[cli]"
68
  huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
69
  ```
70
+
71
+ Then download the Phantom-Wan-1.3B and Phantom-Wan-14B model:
72
  ``` sh
73
+ huggingface-cli download bytedance-research/Phantom --local-dir ./Phantom-Wan-Models
74
  ```
75
+ Alternatively, you can manually download the required models and place them in the `Phantom-Wan-Models` folder.
76
 
77
  ### Run Subject-to-Video Generation
78
 
79
+ #### Phantom-Wan-1.3B
80
+
81
  - Single-GPU inference
82
 
83
  ``` sh
84
+ python generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models/Phantom-Wan-1.3B.pth --ref_image "examples/ref1.png,examples/ref2.png" --prompt "暖阳漫过草地,扎着双马尾、头戴绿色蝴蝶结、身穿浅绿色连衣裙的小女孩蹲在盛开的雏菊旁。她身旁一只棕白相间的狗狗吐着舌头,毛茸茸尾巴欢快摇晃。小女孩笑着举起黄红配色、带有蓝色按钮的玩具相机,将和狗狗的欢乐瞬间定格。" --base_seed 42
85
  ```
86
 
87
  - Multi-GPU inference using FSDP + xDiT USP
88
 
89
  ``` sh
90
  pip install "xfuser>=0.4.1"
91
+ torchrun --nproc_per_node=8 generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models/Phantom-Wan-1.3B.pth --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 4 --ring_size 2 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42
92
  ```
93
 
94
  > 💡Note:
 
98
 
99
  For inferencing examples, please refer to "infer.sh". You will get the following generated results:
100
 
101
+ <table style="width: 100%; border-collapse: collapse; text-align: center; border: 1px solid #ccc;">
102
+ <tr>
103
+ <th style="text-align: center;">
104
+ <strong>Reference Images</strong>
105
+ </th>
106
+ <th style="text-align: center;">
107
+ <strong>Generated Videos (480P)</strong>
108
+ </th>
109
+ </tr>
110
+
111
+ <tr>
112
+ <td style="text-align: center; vertical-align: middle;">
113
+ <img src="assets/ref1.png" alt="Image 1" style="height: 180px;">
114
+ <img src="assets/ref2.png" alt="Image 2" style="height: 180px;">
115
+ </td>
116
+ <td style="text-align: center; vertical-align: middle;">
117
+ <img src="assets/result1.gif" alt="GIF 1" style="width: 400px;">
118
+ </td>
119
+ </tr>
120
+
121
+ <tr>
122
+ <td style="text-align: center; vertical-align: middle;">
123
+ <img src="assets/ref3.png" alt="Image 3" style="height: 180px;">
124
+ <img src="assets/ref4.png" alt="Image 4" style="height: 180px;">
125
+ </td>
126
+ <td style="text-align: center; vertical-align: middle;">
127
+ <img src="assets/result2.gif" alt="GIF 2" style="width: 400px;">
128
+ </td>
129
+ </tr>
130
+
131
+ </tr>
132
  <tr>
133
+ <td style="text-align: center; vertical-align: middle;">
134
+ <img src="assets/ref5.png" alt="Image 5" style="height: 180px;">
135
+ <img src="assets/ref6.png" alt="Image 6" style="height: 180px;">
136
+ <img src="assets/ref7.png" alt="Image 7" style="height: 180px;">
137
+ </td>
138
+ <td style="text-align: center; vertical-align: middle;">
139
+ <img src="assets/result3.gif" alt="GIF 3" style="width: 400px;">
140
+ </td>
141
  </tr>
142
+
143
  <tr>
144
+ <td style="text-align: center; vertical-align: middle;">
145
+ <img src="assets/ref8.png" alt="Image 8" style="height: 100px;">
146
+ <img src="assets/ref9.png" alt="Image 9" style="height: 100px;">
147
+ <img src="assets/ref10.png" alt="Image 10" style="height: 100px;">
148
+ <img src="assets/ref11.png" alt="Image 11" style="height: 100px;">
149
+ </td>
150
+ <td style="text-align: center; vertical-align: middle;">
151
+ <img src="assets/result4.gif" alt="GIF 4" style="width: 400px;">
152
+ </td>
153
  </tr>
154
  </table>
155
 
156
+
157
+ #### Phantom-Wan-14B
158
+
159
+ - Single-GPU inference
160
+
161
+ ``` sh
162
+ python generate.py --task s2v-14B --size 832*480 --frame_num 121 --sample_fps 24 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models --ref_image "examples/ref12.png,examples/ref13.png" --prompt "扎着双丸子头,身着红黑配色并带有火焰纹饰服饰,颈戴金项圈、臂缠金护腕的哪吒,和有着一头淡蓝色头发,额间有蓝色印记,身着一袭白色长袍的敖丙,并肩坐在教室的座位上,他们专注地讨论着书本内容。背景为柔和的灯光和窗外微风拂过的树叶,营造出安静又充满活力的学习氛围。"
163
+ ```
164
+
165
+ - Multi-GPU inference using FSDP + xDiT USP
166
+
167
+ ``` sh
168
+ pip install "xfuser>=0.4.1"
169
+ torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 832*480 --frame_num 121 --sample_fps 24 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models --ref_image "examples/ref14.png,examples/ref15.png,examples/ref16.png" --dit_fsdp --t5_fsdp --ulysses_size 8 --ring_size 1 --prompt "一位戴着黄色帽子、身穿黄色上衣配棕色背带的卡通老爷爷,在装饰有粉色和蓝色桌椅、悬挂着彩色吊灯且摆满彩色圆球装饰的清新卡通风格咖啡馆里,端起一只蓝色且冒着热气的咖啡杯,画面风格卡通、清新。"
170
+ ```
171
+
172
+ > 💡Note:
173
+ > * The currently released Phantom-Wan-14B model was trained on 480P data but can also be applied to generating videos at 720P and higher resolutions, though the results may be less stable. We plan to release a version further trained on 720P data in the future.
174
+ > * The Phantom-Wan-14B model was trained on 24fps data, but it can also generate 16fps videos, similar to the native Wan2.1. However, the quality may experience a slight decline.
175
+
176
+ For more inference examples, please refer to "infer.sh". You will get the following generated results:
177
+
178
+ <table style="width: 100%; border-collapse: collapse; text-align: center; border: 1px solid #ccc;">
179
+ <tr>
180
+ <th style="text-align: center;">
181
+ <strong>Reference Images</strong>
182
+ </th>
183
+ <th style="text-align: center;">
184
+ <strong>Generated Videos (720P)</strong>
185
+ </th>
186
+ </tr>
187
+
188
+ <tr>
189
+ <td style="text-align: center; vertical-align: middle;">
190
+ <img src="assets/ref12.png" alt="Image 1" style="height: 180px;">
191
+ <img src="assets/ref13.png" alt="Image 2" style="height: 180px;">
192
+ </td>
193
+ <td style="text-align: center; vertical-align: middle;">
194
+ <img src="assets/result5.gif" alt="GIF 1" style="width: 400px;">
195
+ </td>
196
+ </tr>
197
+
198
+ <tr>
199
+ <td style="text-align: center; vertical-align: middle;">
200
+ <img src="assets/ref17.png" alt="Image 3" style="height: 150px;">
201
+ <img src="assets/ref18.png" alt="Image 4" style="height: 150px;">
202
+ </td>
203
+ <td style="text-align: center; vertical-align: middle;">
204
+ <img src="assets/result7.gif" alt="GIF 2" style="width: 400px;">
205
+ </td>
206
+ </tr>
207
+
208
+ </tr>
209
+ <tr>
210
+ <td style="text-align: center; vertical-align: middle;">
211
+ <img src="assets/ref14.png" alt="Image 5" style="height: 120px;">
212
+ <img src="assets/ref15.png" alt="Image 6" style="height: 120px;">
213
+ <img src="assets/ref16.png" alt="Image 7" style="height: 120px;">
214
+ </td>
215
+ <td style="text-align: center; vertical-align: middle;">
216
+ <img src="assets/result6.gif" alt="GIF 3" style="width: 400px;">
217
+ </td>
218
+ </tr>
219
+
220
+ </table>
221
+
222
+ > The GIF videos are compressed.
223
+
224
 
225
  ## Acknowledgements
226
  We would like to express our gratitude to the SEED team for their support. Special thanks to Lu Jiang, Haoyuan Guo, Zhibei Ma, and Sen Wang for their assistance with the model and data. In addition, we are also very grateful to Siying Chen, Qingyang Li, and Wei Han for their help with the evaluation.
assets/ref1.png ADDED

Git LFS Details

  • SHA256: 28586e41daf7f45c5e6b8e215cc8c55be08f32dae0f7b5b38540c95952d668c7
  • Pointer size: 131 Bytes
  • Size of remote file: 321 kB
assets/ref10.png ADDED

Git LFS Details

  • SHA256: 19e6e4f071acc3110a9559bae3a6d6379eea6af8b44482608dbcc674bafb2c14
  • Pointer size: 131 Bytes
  • Size of remote file: 350 kB
assets/ref11.png ADDED

Git LFS Details

  • SHA256: 7d5cc55c6360555c9bacffd0d3e3a8fd064f966919dd543448643e6dbcc883c8
  • Pointer size: 132 Bytes
  • Size of remote file: 1.36 MB
assets/ref12.png ADDED

Git LFS Details

  • SHA256: b4e3360f2b931b1082b71a1afcaa4f92d76f08e132b043509e1fde4057ab79f2
  • Pointer size: 131 Bytes
  • Size of remote file: 887 kB
assets/ref13.png ADDED

Git LFS Details

  • SHA256: c23af492984e7ce149f63782f4e378cb8eed92ad0726e3b5527507dce1d67534
  • Pointer size: 131 Bytes
  • Size of remote file: 623 kB
assets/ref14.png ADDED

Git LFS Details

  • SHA256: 6cdc2b9f71f49e08f711ee723bc19e7e75f8970affea21ace36f297781269166
  • Pointer size: 131 Bytes
  • Size of remote file: 525 kB
assets/ref15.png ADDED

Git LFS Details

  • SHA256: bf6b9552a942092d87371c123437761db6520a2f8465e47f2de772f66762c018
  • Pointer size: 131 Bytes
  • Size of remote file: 374 kB
assets/ref16.png ADDED

Git LFS Details

  • SHA256: 23e53b75b1b5f52a18cf72e69d5b566dc30b85aa1215616d4ab70d51b4fe8147
  • Pointer size: 131 Bytes
  • Size of remote file: 801 kB
assets/ref17.png ADDED

Git LFS Details

  • SHA256: 2e65bb33b6bf46d7bf0f430491c55b2cacb9fb318d8f8b8d4ef512872fe8111d
  • Pointer size: 132 Bytes
  • Size of remote file: 2.72 MB
assets/ref18.png ADDED

Git LFS Details

  • SHA256: 6299141e6ff65819b5028be1045522102b2418f8467602de88691b4b96df6a0b
  • Pointer size: 132 Bytes
  • Size of remote file: 3.43 MB
assets/ref2.png ADDED

Git LFS Details

  • SHA256: d152b9f0bb14e404a18a6a4fdfd67e1cb7f504e4f3be48aea095f4ca49e499d1
  • Pointer size: 131 Bytes
  • Size of remote file: 567 kB
assets/ref3.png ADDED

Git LFS Details

  • SHA256: d0e1fb55f84f858ba929b2b55379963244dd2202790e576a28c5543e97148323
  • Pointer size: 131 Bytes
  • Size of remote file: 785 kB
assets/ref4.png ADDED

Git LFS Details

  • SHA256: 1d358f1e10433ad414ffc3f84fc74f48bd42ef36d9ceb798eff80a1022823eee
  • Pointer size: 132 Bytes
  • Size of remote file: 1.71 MB
assets/ref5.png ADDED

Git LFS Details

  • SHA256: 3902f26beae4fc1209d072348330f67bb9639a1a82f32cbbd753d0ca4ae6755f
  • Pointer size: 132 Bytes
  • Size of remote file: 1.43 MB
assets/ref6.png ADDED

Git LFS Details

  • SHA256: 5b0afb49db6848b4c60a2dc9e1a45d8b6d0560d9cac793709aa9a9187b18b8c6
  • Pointer size: 131 Bytes
  • Size of remote file: 868 kB
assets/ref7.png ADDED

Git LFS Details

  • SHA256: c2f531cbf3eee3eb2322a8d9b65f42244bfe4f033cc0ecdcc95286b61cc7b75a
  • Pointer size: 131 Bytes
  • Size of remote file: 665 kB
assets/ref8.png ADDED

Git LFS Details

  • SHA256: 4d1b1f9a6a50cea5b81802533345471ff19b348619c57f4537da97f8bed863a5
  • Pointer size: 131 Bytes
  • Size of remote file: 729 kB
assets/ref9.png ADDED

Git LFS Details

  • SHA256: e010141d01bc4b832426eb701a4af4672e377a9aaa67204313838a5ae10a1c3d
  • Pointer size: 131 Bytes
  • Size of remote file: 420 kB
assets/result5.gif ADDED

Git LFS Details

  • SHA256: 86c5bf1e896b064c85643aeaf1041c3efa6f763a7cf5ae77983a54054d5cd0c2
  • Pointer size: 132 Bytes
  • Size of remote file: 9.9 MB
assets/result6.gif ADDED

Git LFS Details

  • SHA256: 33070f0ec3d187ae67162750083338c898ffb6d8a5766f29b8bf20070a12c06c
  • Pointer size: 132 Bytes
  • Size of remote file: 8.08 MB
assets/result7.gif ADDED

Git LFS Details

  • SHA256: 281cc99d1a64247315e2310a9a44b7e5418e8dc7a662545b3ecb579043bbdd9c
  • Pointer size: 133 Bytes
  • Size of remote file: 27.4 MB