TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
Abstract
TRAVL, a fine-tuning recipe with a trajectory-aware attention module, improves physical plausibility in Video-Language Models using the ImplausiBench benchmark.
Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Bridging Vision Language Models and Symbolic Grounding for Video Question Answering (2025)
- Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility (2025)
- When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding (2025)
- VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL (2025)
- ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video (2025)
- Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs (2025)
- Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper