Congratulations to the team on the release of SmolLM3! This is a really impressive piece of work, and the detailed ablations on GQA, NoPE, and the other architectural tweaks are super valuable for the community.
Your focus on pushing the boundaries of long-context performance is fascinating. It reminded me of a recent paper that tackles the same challenge from a completely different architectural angle.
I was just reading about ATLAS from Google Research (arXiv:2505.23735), which proposes replacing the standard attention mechanism with a modern recurrent "long-term memory module".
The core idea is to overcome the limitations of both Transformers and older RNNs by creating a memory that explicitly "learns to memorize the context" instead of just individual tokens. They introduce a concept called the Omega Rule, which updates the memory based on a sliding window of the past, rather than the "online" token-by-token update that can lead to issues like the "lost in the middle" problem.
The results they show are compelling, especially its ability to scale effectively to a 10M context length on the BABILong benchmark.
It's exciting to see two powerful—and very different—approaches for scaling context length. One path is perfecting the Transformer architecture (like SmolLM3), and the other is designing new memory-centric recurrent models (like ATLAS).
I'm curious to hear your thoughts on the future of these alternative architectures and if you envision a future where hybrid models might combine the best of both worlds!
Here's the paper for anyone interested: