Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models
Abstract
Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition (0 to 2^{64}), probing two core properties: commutativity (A+B=B+A) and compositional generalization (via isomorphic symbolic mappings, e.g., 7 rightarrow y). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to leq7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of A+B neq B+A) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.
Community
It is interesting that LLMs scoring near-perfect on numerical addition catastrophically fail when digits are replaced with symbols, revealing these "PhD-level" models don't actually understand the mathematical principles of elementary addition but merely recognize familiar patterns. Even more surprising, explicitly providing addition rules makes their performance dramatically worse.
I recently watched a veritasium video (https://m.youtube.com/watch?v=0xS68sl2D70) and whilst not scientifically rigorous, he seems to imply that humans don't have this generalisation ability to begin with, just advanced pattern matching. The example given was they asked grandmaster chess players to memorise the board and they could memorise the board well, however if they scrambled the board to a stochastic distribution, the grandmasters were unable to memorise the board and performed equal to non-chess players.
I'm starting to wonder whether pattern matching is as good as it gets. Perhaps we are trying to train the models to do something that we ourselves don't have the ability to do.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks (2025)
- Towards Reasoning Ability of Small Language Models (2025)
- Benchmarking Reasoning Robustness in Large Language Models (2025)
- Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models (2025)
- Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them (2025)
- Do Reasoning Models Show Better Verbalized Calibration? (2025)
- Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper