logo

🐱 SANA-Video Model Card

SANA-Video is a small, ultra-efficient diffusion model designed for rapid generation of high-quality, minute-long videos at resolutions up to 720×1280.

Key innovations and efficiency drivers include:

(1) Linear DiT: Leverages linear attention as the core operation, offering significantly more efficiency than vanilla attention when processing the massive number of tokens required for video generation.

(2) Constant-Memory KV Cache for Block Linear Attention: Implements a block-wise autoregressive approach that uses the cumulative properties of linear attention to maintain global context at a fixed memory cost, eliminating the traditional KV cache bottleneck and enabling efficient, minute-long video synthesis.

SANA-Video achieves exceptional efficiency and cost savings: its training cost is only 1% of MovieGen's (12 days on 64 H100 GPUs). Compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1 and SkyReel-V2), SANA-Video maintains competitive performance while being 16× faster in measured latency. SANA-Video is deployable on RTX 5090 GPUs, accelerating the inference speed for a 5-second 720p video from 71s down to 29s (2.4× speedup), setting a new standard for low-cost, high-quality video generation.

Source code is available at https://github.com/NVlabs/Sana.

🐱 How to Inference

import torch
from diffusers import SanaPipeline, SanaVideoPipeline, DPMSolverMultistepScheduler
from diffusers import AutoencoderKLWan
from diffusers.utils import export_to_video


model_id = "Efficient-Large-Model/SANA-Video_2B_480p_diffusers"
pipe = SanaVideoPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
# pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=8.0)
pipe.vae.to(torch.float32)
pipe.text_encoder.to(torch.bfloat16)
pipe.to("cuda")
model_score = 30

prompt = "Evening, backlight, side lighting, soft light, high contrast, mid-shot, centered composition, clean solo shot, warm color. A young Caucasian man stands in a forest, golden light glimmers on his hair as sunlight filters through the leaves. He wears a light shirt, wind gently blowing his hair and collar, light dances across his face with his movements. The background is blurred, with dappled light and soft tree shadows in the distance. The camera focuses on his lifted gaze, clear and emotional."
negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience."
motion_prompt = f" motion score: {model_score}."
prompt = prompt + motion_prompt

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=480,
    width=832,
    frames=81,
    guidance_scale=6,
    num_inference_steps=50,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "sana_video.mp4", fps=16)

Model Description

Developed by: NVIDIA, Sana
Model type: Efficient Video Generation with Block Linear Diffusion Transformer
Model size: 2B parameters
Model precision: torch.bfloat16 (BF16)
Model resolution: This model is developed to generate 480p resolution 81(5s) frames videos with multi-scale heigh and width.
Model Description: This is a model that can be used to generate and modify videos based on text prompts. It is a Linear Diffusion Transformer that uses 8x wan-vae one 32x spatial-compressed latent feature encoder (DC-AE-V).
Resources for more information: Check out our GitHub Repository and the SANA-Video report on arXiv.

Model Sources

For research purposes, we recommend our generative-models Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference

Repository: https://github.com/NVlabs/Sana
Guidance: https://github.com/NVlabs/Sana/asset/docs/sana_video.md

License/Terms of Use

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement.

Uses

Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

Generation of artworks and use in design and other artistic processes.
Applications in educational or creative tools.
Research on generative models.
Safe deployment of models which have the potential to generate harmful content.
Probing and understanding the limitations and biases of generative models.

Excluded uses are described below.

Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

Limitations and Bias

Limitations

The model does not achieve perfect photorealism
The model cannot render complex legible text
fingers, .etc in general may not be generated properly.
The autoencoding part of the model is lossy.

Bias

While the capabilities of video generation models are impressive, they can also reinforce or exacerbate social biases.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Efficient-Large-Model/SANA-Video_2B_480p_diffusers

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including Efficient-Large-Model/SANA-Video_2B_480p_diffusers

SANA-Video

Collection

🎬 SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer • 7 items • Updated 3 days ago • 5