Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning
Abstract
The study investigates reproducibility issues in Large Language Models (LLMs) arising from hardware and precision variations, proposing a lightweight inference pipeline to enhance numerical stability while maintaining memory efficiency.
Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.
Community
This new study reveals that seemingly innocuous factors—like evaluation batch size, GPU count, or precision format—can cause large language models’ outputs to diverge dramatically, undermining reproducibility in reasoning tasks. The paper systematically traces this brittleness to the non-associative nature of floating-point arithmetic, showing up to 9% swings in accuracy and thousands of tokens difference in output length under bfloat16 greedy decoding across GPU setups. To reconcile memory efficiency with reliable inference, they introduce LayerCast, a lightweight pipeline that stores weights at 16-bit but computes in FP32, delivering stable, reproducible reasoning without prohibitive overhead.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning (2025)
- TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression (2025)
- LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning (2025)
- Draft-based Approximate Inference for LLMs (2025)
- rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset (2025)
- Energy Considerations of Large Language Model Inference and Efficiency Optimizations (2025)
- InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper