RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Abstract
RE-IMAGINE evaluates the reasoning abilities of Large Language Models by generating variations of problems that cannot be solved by memorization, indicating reliance on statistical recall.
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.
Community
Most benchmarks call a task "hard" only after large language models do poorly on it. This is backwards: the result defines the difficulty!!
We do the opposite. We begin with Judea Pearl's Ladder of Causation: association (Level-1), intervention (Level-2), counterfactual (Level-3). This idea from the causality literature that describes what levels of reasoning a human can reach. We treat those same levels asa clear target for LLMs.
With this theory in mind, we built RE-IMAGINE, a system that rewrites existing benchmark problems at each level. The new versions cannot be solved by memorizing patterns, so they give a fairer test of real reasoning. Across math, code, and logic.
This shows that a causality-based view of reasoning gives a more reliable way to measure reasoning ability
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study (2025)
- Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective (2025)
- If Pigs Could Fly... Can LLMs Logically Reason Through Counterfactuals? (2025)
- Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation (2025)
- Scaling Reasoning can Improve Factuality in Large Language Models (2025)
- Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation (2025)
- CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper