--- license: apache-2.0 ---
Lijie Liu*, Tianxiang Ma*, Bingchuan Li*, Zhuowei Chen*, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, Xinglong Wu
> * Equal contribution, † Project lead
>Intelligent Creation Team, ByteDance
## 🔥 Latest News! * Apr 10, 2025: We have updated the full version of the Phantom paper, which now includes more detailed descriptions of the model architecture and dataset pipeline. * Apr 20, 2025: 👋 Phantom-Wan is coming! We adapted the Phantom framework into the [Wan2.1](https://github.com/Wan-Video/Wan2.1) video generation model. The inference codes and checkpoint have been released. ## 📑 Todo List - [x] Inference codes and Checkpoint of Phantom-Wan 1.3B - [ ] Checkpoint of Phantom-Wan 14B - [ ] Training codes of Phantom-Wan ## 📖 Overview Phantom is a unified video generation framework for single and multi-subject references, built on existing text-to-video and image-to-video architectures. It achieves cross-modal alignment using text-image-video triplet data by redesigning the joint text-image injection model. Additionally, it emphasizes subject consistency in human generation while enhancing ID-preserving video generation. ## ⚡️ Quickstart ### Installation Clone the repo: ```sh git clone https://github.com/Phantom-video/Phantom.git cd Phantom ``` Install dependencies: ```sh # Ensure torch >= 2.4.0 pip install -r requirements.txt ``` ### Model Download First you need to download the 1.3B original model of Wan2.1. Download Wan2.1-1.3B using huggingface-cli: ``` sh pip install "huggingface_hub[cli]" huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B ``` Then download the Phantom-Wan-1.3B model: ``` sh huggingface-cli download xxx --local-dir ./Phantom-Wan-1.3B ``` ### Run Subject-to-Video Generation - Single-GPU inference ``` sh python generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-1.3B/Phantom-Wan-1.3B.pth --ref_image "examples/ref1.png,examples/ref2.png" --prompt "暖阳漫过草地,扎着双马尾、头戴绿色蝴蝶结、身穿浅绿色连衣裙的小女孩蹲在盛开的雏菊旁。她身旁一只棕白相间的狗狗吐着舌头,毛茸茸尾巴欢快摇晃。小女孩笑着举起黄红配色、带有蓝色按钮的玩具相机,将和狗狗的欢乐瞬间定格。" --base_seed 42 ``` - Multi-GPU inference using FSDP + xDiT USP ``` sh pip install "xfuser>=0.4.1" torchrun --nproc_per_node=8 generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-1.3B/Phantom-Wan-1.3B.pth --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 4 --ring_size 2 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42 ``` > 💡Note: > * Changing `--ref_image` can achieve single reference Subject-to-Video generation or multi-reference Subject-to-Video generation. The number of reference images should be within 4. > * To achieve the best generation results, we recommend that you describe the visual content of the reference image as accurately as possible when writing `--prompt`. For example, "examples/ref1.png" can be described as "a toy camera in yellow and red with blue buttons". > * When the generated video is unsatisfactory, the most straightforward solution is to try changing the `--base_seed` and modifying the description in the `--prompt`. For inferencing examples, please refer to "infer.sh". You will get the following generated results:
![]() |
![]() |
![]() |
![]() |