--- license: mit library_name: diffusers pipeline_tag: image-to-video ---

## 🔥🔥🔥 News!! * Mar 17, 2025: 👋 We release the inference code and model weights of Step-Video-Ti2V. [Download](https://huggingface.co/stepfun-ai/stepvideo-ti2v) * Mar 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248) ## 🔧 4.2 Dependencies and Installation ```bash git clone https://github.com/stepfun-ai/Step-Video-TI2V.git conda create -n stepvideo python=3.10 conda activate stepvideo cd StepFun-StepVideo pip install -e . ``` ## 🚀 Inference Scripts - We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding. ```bash python api/call_remote_server.py --model_dir where_you_download_dir & ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command. parallel=4 # or parallel=8 url='127.0.0.1' model_dir=where_you_download_dir torchrun --nproc_per_node $parallel run_parallel.py \ --model_dir $model_dir \ --vae_url $url \ --caption_url $url \ --ulysses_degree $parallel \ --prompt "男孩笑起来" \ --first_image_path ./assets/demo.png \ --infer_steps 50 \ --save_path ./results \ --cfg_scale 9.0 \ --motion_score 5.0 \ --time_shift 12.573 ``` The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos: | GPU | height/width/frame | Peak GPU Memory | 50 steps | |------|--------------------|-----------------|----------| | 1 | 768px × 768px × 102f | 76.42 GB | 1061s | | 1 | 544px × 992px × 102f | 75.49 GB | 929s | | 4 | 768px × 768px × 102f | 64.63 GB | 288s | | 4 | 544px × 992px × 102f | 64.34 GB | 251s |