Abstract
Cautious Weight Decay (CWD) enhances optimizer performance by applying weight decay selectively, improving accuracy and loss in large-scale models without additional tuning.
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Community
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that
applies weight decay only to parameter coordinates whose signs align with the optimizer update.
Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode
behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal
stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers
such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning.
For language model pre-training and ImageNet classification, CWD consistently improves final
loss and accuracy at million- to billion-parameter scales.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates (2025)
- ANO : Faster is Better in Noisy Landscape (2025)
- REG: A Regularization Optimizer for Robust Training Dynamics (2025)
- Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization (2025)
- Conda: Column-Normalized Adam for Training Large Language Models Faster (2025)
- Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control (2025)
- Muon: Training and Trade-offs with Latent Attention and MoE (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Another question: I’m using the Conda optimizer (just to mention it). Do you have any idea how CWD could be added to it? In my case, when I included CWD in Conda, the results turned out worse than with normal weight decay
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper