NVIDIA Cosmos Now Available On Hugging Face For Physical AI Reasoning

Community Article Published May 19, 2025

Cosmos Reason is a world foundation model (WFM) for physical AI — built not just to see, but to reason. The model understands physical common sense and given a text prompt and an input video, it can think, run a chain of thoughts reasoning process and generate the answer for the text prompt. The model can be used to critic synthetic video data and create better datasets with accurate captions to train robots and autonomous vehicles.

In this article, we’ll explore how this model works, how it is built, and how you can use it.

Inside Cosmos Reason

Cosmos Reason is built using supervised fine-tuning (SFT) and reinforcement learning that bridges multimodal perception and real-world decision-making:

Physical AI SFT: Focuses on real-world reasoning. Learns object affordances (e.g., "a pan conducts heat"), action chains (multi-step plans), and spatial feasibility (e.g., "a person can't walk through walls") using curated physical interaction datasets.

Reinforcement learning for embodied decisions: The long chain-of-thought reasoning capability in Cosmos Reason enables training with a small training size and generalizing to held-out test scenarios. Verifiable Physical AI rewards like “arrow-of-time” enable learning world dynamics without human annotations.

Testing Cosmos Reason on Common Sense

Cosmos-Reason excels at understanding real-world physical situations—like how objects and people interact in dynamic environments—using both video and text. Evaluated across benchmarks like BridgeData V2, RoboVQA, and Agibot, the model shows strong common-sense reasoning and situational awareness.

Fine-tuning on physical AI tasks boosts the base vision-language model's performance by over 10%, while reinforcement learning adds another 5% gain. On average, Cosmos-Reason achieves a score of 65.7 across key benchmarks, setting a high bar for AI systems in robotics, autonomous vehicles, and embodied agents.

There’s still room for improvement: post-training on high-quality, task-specific curated data and continued reinforcement learning can further enhance performance of Cosmos Reason.

Dataset Score
Common Sense 56.2
BridgeData V2 73.5
RoboVQA 86.8
Agibot 54.2
HoloAssist 60.0
AV 67.0
RoboFail 62.0
Avg. 65.7

Explore these benchmarks on Hugging Face: nvidia/Cosmos-Reason1-Benchmark

How to use Cosmos Reason

You can download the model checkpoint from Hugging Face and use the inference and post-training scripts available on GitHub.

Input

Input Type(s): Text+Video/Image

Input Format(s):

Text: String Video: mp4 Image: jpg Input Parameters:

Text: One-dimensional (1D) Video: Three-dimensional (3D) Image: Two-dimensional (2D) Other Properties Related to Input:

Use FPS=4 for input video to match the training setup. Append Answer the question in the following format: \nyour reasoning\n\n\n\nyour answer\n. in the system prompt to encourage long chain-of-thought reasoning response.

Output

Output Type(s): Text

Output Format: String

Output Parameters: Text: One-dimensional (1D)

Other Properties Related to Output:

Recommend using 4096 or more output max tokens to avoid truncation of long chain-of-thought response. Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

All Cosmos World Foundation Models Available On Hugging Face

Cosmos Predict 1 - Multimodal world foundation model for generating next frames based on input prompts. Explore GitHub repository for inference and post-training scripts.

Cosmos Transfer 1 - Multicontrol world foundation model for data augmentation from structured video inputs. Explore GitHub for inference and post-training scripts.

Join our community for regular updates, Q&A, livestreams and hands-on tutorials!

Community

Sign up or log in to comment