Introducing Cosmos Predict-2: A Foundation For Your Own World Model

Community Article Published June 17, 2025

Upvote

nvidia

nvidia

nvidia

Building smarter robots and autonomous vehicles (AVs) starts with foundation models that understand real-world dynamics. These models serve two critical roles:

Accelerating synthetic data generation (SDG) to teach machines about real-world physics and interactions—including rare edge cases
Serving as base models that can be post-trained for specialized tasks or adapted to different output types

Cosmos Predict-1 was built for this purpose, generating realistic, physics-aware future world states.

Now, Cosmos Predict-2 introduces major upgrades in speed, visual quality, and customization.

🔭 Introducing Cosmos Predict-2

Cosmos Predict-2 is our top-performing world foundation model for physical AI with architectural refinements that improve speed, scalability, and allow resolution and framerate flexibility across use cases and hardware platforms.

Two model variants are available, optimized for task complexity:

Cosmos Predict-2 2B
Fast inference and low memory usage. Ideal for prototyping, low-latency applications, and edge deployments.
Cosmos Predict-2 14B
Designed for high-fidelity world modeling, complex scene understanding, extended temporal coherence, and prompt precision.

Developers can start with the video2world model by using a reference image of a robot or AV environment to generate consistent, physically accurate world states as video. The text-to-image model can also create a preview image from a text prompt.

📐 Resolution & Framerate Options

Cosmos Predict-2 offers flexible output formats:

Resolution
- 720p
- 480p for faster throughput
Framerate
- Available: 10 fps, 16 fps
- Coming soon: 24 fps (ideal for 10 Hz simulation and AV training pipelines)

⚙️ Inference & Performance

Cosmos Predict-2 is designed for fast, flexible inference across hardware setups.

2B Variant :For fast performance on NVIDIA GB200 NVL72, DGX B200, RTX PRO 6000 or RTX 6000 Ada.
14B Variant : For higher fidelity and for complex, temporally coherent tasks on GB200/B200 systems.

import torch
from imaginaire.utils.io import save_image_or_video
from cosmos_predict2.configs.base.config_video2world import PREDICT2_VIDEO2WORLD_PIPELINE_2B
from cosmos_predict2.pipelines.video2world import Video2WorldPipeline

# Create the video generation pipeline.
pipe = Video2WorldPipeline.from_config(
    config=PREDICT2_VIDEO2WORLD_PIPELINE_2B,
    dit_path="checkpoints/nvidia/Cosmos-Predict2-2B-Video2World/model-720p-16fps.pt",
    text_encoder_path="checkpoints/google-t5/t5-11b",
)

# Specify the input image path and text prompt.
image_path = "assets/video2world/example_input.jpg"
prompt = "A high-definition video captures the precision of robotic welding in an industrial setting. The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation."

# Run the video generation pipeline.
video = pipe(input_path=image_path, prompt=prompt)

# Save the resulting output video.
save_image_or_video(video, "output/test.mp4", fps=16)

👉 Full setup instructions: nvidia-cosmos/cosmos-predict2 GitHub repo

🛠️ Post-train Cosmos Predict-2 for your use case

Cosmos Predict-2 can be post-trained for custom applications such as robotics, AVs, and industrial automation. Post-training usecases can be:

Domain	Hardware-Specific Manipulation	Example Application
Robotics	Instruction control, object manipulation	Pick apples with varying stem strength
AVs	Multiview gen, edge-case sim.	Rainy highway driving with lidar/camera sync
Industrial	Action-conditioned workflows	Predictive maintenance for conveyor belt robots
Vision	Camera pose conditioning	3D-consistent video from single images

Let's say you need to perform post-training for the first example from the table above.

Step 1: Prepare the Data Using Open Source Data Curator

Collect 100+ hours of teleoperator representative data for your task
Use Cosmos Curate to processes, analyzes, and organizes video content.
You will need accurate text+video pairings for the next step.

Step 2: Post-train the Model

Use training scripts for reference
Fine-tune on your robot/environment using curated data.

Step 3: Generate Synthetic Scenarios

Example Prompt: "Pick up the bruised apple under low light"
You can also condition with an image

Step 4: Validate with Cosmos Reason
Cosmos Reason is a spatiotemporally aware physical AI reasoning model that critiques synthetic video data based on training data quality standards. Sample evaluation prompts include:

✅ Does the robot grasp the apple properly?
✅ Are joint angles within safe limits?
✅ Any motion artifacts or object collisions?

Cosmos Predict-2 Post-training Samples:

Cosmos-Predict2-14B-Video2World-Sample-GR00T-Dreams-GR1:Video + Text based future visual world generation, post trained on GR00T GR1 data

Cosmos-Predict2-14B-Video2World-Sample-GR00T-Dreams-DROID:Video + Text based future visual world generation, post trained on GR00T DROID data

🧠 Try Cosmos Predict-2 Today

Cosmos Predict-2 is now available on Hugging Face. Explore the GitHub repo for detailed setup instructions: nvidia-cosmos/cosmos-predict2

Learn more: NVIDIA Cosmos

Join our community for learning content, livestreams and discussions: NVIDIA Omniverse Community

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote