Papers
arxiv:2504.05262

Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

Published on Apr 7
· Submitted by DannyLan on Apr 14
Authors:
,
,

Abstract

Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition (0 to 2^{64}), probing two core properties: commutativity (A+B=B+A) and compositional generalization (via isomorphic symbolic mappings, e.g., 7 rightarrow y). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to leq7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of A+B neq B+A) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.

Community

Paper submitter

It is interesting that LLMs scoring near-perfect on numerical addition catastrophically fail when digits are replaced with symbols, revealing these "PhD-level" models don't actually understand the mathematical principles of elementary addition but merely recognize familiar patterns. Even more surprising, explicitly providing addition rules makes their performance dramatically worse.

·

I recently watched a veritasium video (https://m.youtube.com/watch?v=0xS68sl2D70) and whilst not scientifically rigorous, he seems to imply that humans don't have this generalisation ability to begin with, just advanced pattern matching. The example given was they asked grandmaster chess players to memorise the board and they could memorise the board well, however if they scrambled the board to a stochastic distribution, the grandmasters were unable to memorise the board and performed equal to non-chess players.

I'm starting to wonder whether pattern matching is as good as it gets. Perhaps we are trying to train the models to do something that we ourselves don't have the ability to do.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.05262 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.05262 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.05262 in a Space README.md to link it from this page.

Collections including this paper 3