koustuvs's picture
Initial commit
5f0fbb6
import gradio as gr
TITLE = """<h1 align="center" id="space-title">Leaderboard: Physical Reasoning from Video</h1>"""
INTRODUCTION_TEXT = """
This leaderboard tracks the progress of frontier models on 3 physical reasoning benchmark datasets released by Meta FAIR -- **Minimal Video Pairs (MVPBench)**, **IntPhys 2**, and **CausalVQA**. In addition to tracking the progress of the community through public submissions, we also present human scores for each benchmark to understand the gap between leading models and human performance on key physical and video reasoning tasks.
- **[MVPBench](https://github.com/facebookresearch/minimal_video_pairs):** A Video Question Answering (VQA) benchmark for spatio-temporal and intuitive physics video understanding. Videos were sourced from diverse datasets and automatically paired together, such that videos in each pair differ in only minimal ways, but have opposing correct answers for the same question. This design ensures that models need to go beyond relying on surface visual or textual biases in order to perform well on the benchmark.
- **[IntPhys 2](https://github.com/facebookresearch/IntPhys2):** A video benchmark designed to evaluate the intuitive physics understanding of deep learning models. IntPhys 2 focuses on four core principles: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity, and offers a comprehensive suite of tests based on the violation of expectation framework, which challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments.
- **[CausalVQA](https://github.com/facebookresearch/CausalVQA):** A Video Question Answering (VQA) benchmark composed of question-answer pairs that probe models’ understanding of causality in the physical world. Questions were designed to be grounded in real-world scenarios, while focusing on models’ ability to predict the likely outcomes of different actions and events through five question types – counterfactual, hypothetical, anticipation, planning and descriptive.
Please see the linked GitHub repos for instructions on how to download and run each benchmark, and the instructions below on how to submit results to the leaderboard.
"""
SUBMISSION_TEXT = """
## Submissions
Scores are calculated according to the metrics defined in each dataset in the leaderboard. Please see each datasets' linked repos below for more details on metric calculation.
- **[MVPBench](https://github.com/facebookresearch/minimal_video_pairs):** The leaderboard computes and tracks performance on the _"mini"_ split of MVPBench, which contains ~5k items.
- **[IntPhys 2](https://github.com/facebookresearch/IntPhys2):** The leaderboard computes and tracks performance on the _"held out"_ split of IntPhys 2, a set of 344 simulated videos with moving cameras, realistic objects, and complex backgrounds.
- **[CausalVQA](https://github.com/facebookresearch/CausalVQA):** The leaderboard computes and tracks performance on the _"held out"_ split of CausalVQA, which contains 793 pairs of curated questions.
Note that for IntPhys 2 and CausalVQA, submitting to this leaderboard is the only way to compute accuracy on the _held-out_ splits, as the correct answers on those splits are stored privately and not publicly available.
## Format
Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer.
We expect submissions to be json-line files with the following format:
```
{"data_name": "mvp", "task": "some_task", "row_id": 0, "model_answer": "Answer 1 from your model"}
{"data_name": "mvp", "task": "some_task", "row_id" : 1, "model_answer": "Answer 2 from your model"}
```
Here, `data_name` can be one of three valid options: `mvp_mini`, `intphys2`, `causalvqa`. You can merge your results from multiple datasets into a single submission.
If the dataset contains multiple tasks (as in case of MVPBench and CausalVQA), the `task` field is used to identify the subtask. For datasets that do not have `task`, this field can be kept blank; however, the submission file should have this field in the json.
For each submission we receive, we will check if all rows from the dataset are valid.
"""
CITATION_BUTTON_LABEL = "Please cite the following papers if you use these benchmarks"
CITATION_BUTTON_TEXT = r"""
```latex
@misc{mvpbench,
title={A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs},
author={Benno Krojer and Mojtaba Komeili and Candace Ross and Quentin Garrido and Koustuv Sinha and Nicolas Ballas and Mahmoud Assran},
year={2025},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{causalvqa,
title={CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models},
author={Aaron Foss and Chloe Evans and Sasha Mitts and Koustuv Sinha and Ammar Rizvi and Justine T Kao},
year={2025},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{intphys2,
title={IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments},
author={Florian Bordes and Quentin Garrido and Justine T Kao and Adina Williams and Michael Rabbat and Emmanuel Dupoux},
year={2025},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
"""
def format_error(msg):
raise gr.Error(msg)
def format_warning(msg):
return gr.Warning(msg)
def format_log(msg):
return gr.Info(msg)
def model_hyperlink(link, model_name):
return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'