Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks Paper • 2505.11881 • Published May 17 • 4
Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks Paper • 2505.11881 • Published May 17 • 4 • 2
Running 2.8k 2.8k The Ultra-Scale Playbook 🌌 The ultimate guide to training LLM on large GPU Clusters
Forgetting Transformer: Softmax Attention with a Forget Gate Paper • 2503.02130 • Published Mar 3 • 32