How Much Power does a SOTA Open Video Model Use? ⚡🎥

Community Article Published July 2, 2025

Upvote

TL;DR We ran some energy benchmarking for some of the latest open-source text-to-video models — including Mochi-1-preview, CogVideoX-5b, WAN2.1-T2V-1.3B-Diffusers, AnimateDiff, and others — and measured how much energy it takes to produce a short video.

The results were striking: generating a single clip can cost anywhere from a few watt-minutes to over 100 Wh, depending on the model — a nearly 800× difference in energy usage.

Fig: Video generated with Mochi-1-preview | Prompt: Close-up of a chameleon's eye, its scaly skin changing color (4K)

📌 Why This Benchmark?

Recent breakthroughs like Sora by OpenAI and Veo 3 by Google DeepMind have flooded social feeds with jaw-dropping AI-generated videos — setting the bar higher than ever for text-to-video generation. Meanwhile, the open-source community is rapidly catching up, releasing powerful models that anyone can run on a decent GPU. But flashy demos don't reveal the hidden costs: how much energy and compute time does it really take to make just a few seconds of footage? We wanted to find out by comparing several popular open models under the same conditions and sharing reproducible numbers, so you know exactly what to expect when generating your next clip.

Fig: Video generated with CogVideoX-5b | Prompt: "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes...

⚙️ Experimental Setup

Hardware

CPU: 8 cores — AMD EPYC 7R13 Processor
GPU: 1 × NVIDIA H100 80GB HBM3

Methodology

2 warmup runs to stabilise performance.
5 measured runs per model.
Energy usage tracked with CodeCarbon.
Each model was run with the recommended parameters from its official Hugging Face page. This means resolution, number of frames, FPS and sampling steps may differ across models — reflecting realistic usage instead of enforcing identical settings that might hurt quality.

Key Parameters (per model)

Model	Steps	Resolution (HxW)	Frames	FPS	HF Page
AnimateDiff	4	512×512	16	10	AnimateDiff
CogVideoX-2b	50	480×720	49	8	CogVideoX-2b
CogVideoX-5b	50	480×720	49	8	CogVideoX-5b
LTX-Video-0.9.7-dev	30	512×704	121	24	LTX-Video
Mochi-1-preview	64	480×848	84	30	Mochi-1
WAN2.1-T2V-1.3B	60	480×832	81	15	WAN2.1-T2V-1.3B
WAN2.1-T2V-14B	60	480×832	81	15	WAN2.1-T2V-14B

What these parameters mean:

Steps — number of denoising or sampling steps; more steps often mean better details but more time and energy.
Resolution (HxW) — video frame size in pixels; higher resolution increases GPU load.
Frames — total frames generated per clip; more frames = longer video and more compute.
FPS (Frames Per Second) — playback smoothness; higher FPS needs more unique frames per second.
HF Page — link to the official Hugging Face model card with instructions and recommendations.

Sample Prompts: To test a range of scenes, we used various prompts — from cinematic cityscapes to wildlife close-ups and playful mascot vlogs. Examples include:

"A futuristic cityscape at night, neon lights reflecting on wet streets."
"A majestic dragon flying over snowy mountains."
"A realistic gorilla wearing a yellow Hugging Face t-shirt, filming itself in selfie mode while walking around Paris landmarks like the Eiffel Tower."
"Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4K."

Fig: Video generated with WAN2.1-T2V-14B | Prompt: A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon...

🎬 First Look: How Do the Videos Look?

Following the Veo 3 trend of gorilla vlogging, we tried it ourselves with the prompt: “A realistic gorilla wearing a yellow Hugging Face t-shirt, filming itself in selfie mode while walking around Paris landmarks like the Eiffel Tower.”

Quick impressions:

AnimateDiff produces a very short clip but struggles to respect the gorilla look.
CogVideoX keeps a consistent ape appearance but motion glitches through the head sometimes.
Mochi is my favourite: smooth motion, very coherent, though colors look slightly washed out.
LTX-Video shows fluid motion but can feel floaty or ghost-like.
WAN2.1-T2V models render a polished, image-like style — not very realistic for live action, but they display the Hugging Face text on the t-shirt impressively well.

🔌 Energy Use: The Numbers

There are huge differences in energy use between models — sometimes by orders of magnitude. For example, AnimateDiff uses barely 0.11 Wh on GPU to produce a short clip, while WAN2.1-T2V-14B burns almost 94 Wh just on the GPU for a single video.

We averaged the GPU + CPU + RAM energy per generated clip, amortising the warmup cost (black error bars indicate the standard deviation across 10 prompts generations):

Model	Avg. GPU Energy (Wh)	Avg. CPU Energy (Wh)	Avg. RAM Energy (Wh)
AnimateDiff	0.11	0.02	0.01
LTX-Video-0.9.7-dev	3.19	0.41	0.19
CogVideoX-2b	8.32	1.21	0.55
CogVideoX-5b	21.71	2.91	1.31
WAN2.1-T2V-1.3B	19.73	1.98	1.09
WAN2.1-T2V-14B	93.83	10.47	5.17
Mochi-1-preview	46.77	6.40	2.89

How big is that in everyday terms?

To give a sense of scale:

AnimateDiff (~0.14 Wh total) ≈ 50 minutes of a 10W LED bulb
(0.14 Wh ÷ 10 W = 0.014 h ≈ 50 min)
Mochi-1-preview (~56 Wh) ≈ 5 minutes of microwave use
(56 Wh ÷ 1,200 W = ~0.047 h ≈ 3 min)
WAN2.1-T2V-14B (~109 Wh) ≈ 7–10 full smartphone charges
(109 Wh ÷ 15 Wh = ~7.3 charges)

As a reference point, in our “Thank you” energy study, we measured that a single polite reply to LLaMA 3–8B costs about 0.245 Wh.

So:

AnimateDiff ≈ 0.5× a thank-you
Mochi-1-preview ≈ 190×
WAN2.1-T2V-14B ≈ 380×

🧩 Why These Differences?

The wide gaps in energy use come from several factors deeply tied to how these models are built and how they generate videos:

Model Size (Parameters) — Larger models like WAN2.1-T2V-14B or Mochi-1-preview have billions more parameters to process per denoising step. This naturally scales up compute and energy use compared to lightweight pipelines like AnimateDiff.
Number of Denoising Steps — Each extra sampling or denoising step repeats heavy matrix operations. For example, AnimateDiff uses just 4 steps for quick generation, while Mochi can run 64 steps for sharper detail and smoother motion.
Spatial Resolution & Temporal Length — Higher resolutions mean more pixels to refine at every step. Likewise, more frames or higher FPS extend how many times the model must sample or interpolate motion. So a 30 fps, 100-frame clip costs much more than a short low-res GIF.
Architecture Differences — Under the hood, these models mix transformers, diffusion samplers, and sometimes motion modules or temporal attention. Some, like CogVideoX, use cascaded stages (base video + refiners), which multiplies the sampling passes and runtime. Others, like AnimateDiff, favor speed at the cost of complexity by applying motion layers on top of static diffusion.

In practice, AnimateDiff also produces very short clips with few frames and steps, and outputs directly as a GIF, which further keeps its energy footprint low.

📈 Still Early Days for Open Video Models

Despite the hefty compute cost today, open video generation is still in its early days, much like LLMs were a few years ago. Video generation relies on architectures and sampling algorithms that are naturally heavier in FLOPs than typical text models — since they must maintain spatial and temporal coherence at once (think “many images + motion, guided by text”). On top of that, the task itself adds extra complexity beyond pure text generation.

However, just as we’ve seen massive efficiency gains in language models (quantization, faster kernels, smarter sampling), similar optimizations will likely come to video:

Better motion priors and reuse of frames
Advanced caching of intermediate steps
Lightweight transformers fine-tuned for temporal tasks

✅ Conclusion

Text-to-video is moving incredibly fast in the open-source world — and this benchmark shows both the promise and the current limitations. Today’s pipelines still consume significant energy to generate just a few seconds of footage, mainly due to large model sizes, multiple sampling steps, and the challenge of handling both spatial detail and temporal coherence simultaneously.

But just like early language models, we can expect huge leaps in efficiency: smarter architectures, faster samplers, and clever reuse of frames are already being explored. Knowing the real energy footprint helps us track progress, balance quality vs. cost, and push for more sustainable, accessible generative video tools.

🔗 Reproduce & Explore

Want to dig deeper or run your own tests? Check out the code, results, and all generated videos here:

📁 Code & Benchmark Scripts → GitHub: JulienDelavande/benchlab
📂 Datasets & All Generated Videos → Hugging Face Collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote