NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
Abstract
NewtonBench is a benchmark for scientific law discovery that addresses scalability, scientific relevance, and memorization resistance by using metaphysical shifts and interactive model discovery.
Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.
Community
Code & Data: https://github.com/HKUST-KnowComp/NewtonBench
NewtonBench constructed a counterfactual physical law discovery task, requiring LLM (agents) to actively experiment and explore simulation system parameters to discover hidden physical laws. This task is more challenging and novel than fitting given data in Scientific Discovery. NewtonBench also integrates code agents, allowing LLM agents to use code interfaces to fit and explore.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery (2025)
- Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery (2025)
- ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution (2025)
- The Need for Verification in AI-Driven Scientific Discovery (2025)
- Virtuous Machines: Towards Artificial General Science (2025)
- A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers (2025)
- DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper