Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Abstract
This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/
Community
Check out the demo video:
YouTube: https://www.youtube.com/watch?v=OaPI6K2y3rI
X: https://x.com/CeyuanY/status/1911618555210334350
Looks great!
Do you plan on releasing the weights? This would be quite something for local inference on consumer GPU's with the low inference cost.
I've only read the project page, considering it has the ICL ability to learn from reference images, I imagine it is not advisable to release the weights publicly.
great!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k (2025)
- Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model (2025)
- Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos (2025)
- HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models (2025)
- CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance (2025)
- Training Video Foundation Models with NVIDIA NeMo (2025)
- AMD-Hummingbird: Towards an Efficient Text-to-Video Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper