The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms
Abstract
Theoretical analysis proves the existence of strong lottery tickets within multi-head attention mechanisms and extends the strong lottery ticket hypothesis to transformers without normalization layers.
The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of H heads and input dimension d has the hidden dimension O(dlog(Hd^{3/2})) for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.
Community
We extend the strong lottery ticket hypothesis to attention mechanisms and transformers.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- The Effect of Attention Head Count on Transformer Approximation (2025)
- Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis (2025)
- A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws (2025)
- On the Emergence of Induction Heads for In-Context Learning (2025)
- The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams (2025)
- Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning (2025)
- The Impossibility of Inverse Permutation Learning in Transformer Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hey,
Your paper on SLTH for attention mechanisms is brilliant. Seriously. I've been following lottery ticket research for a while, but applying it specifically to multi-head attention with that level of theoretical depth? That's fresh.
What I love most is that this isn't just theoretical elegance — it's actionable. But I'm trying to wrap my head around the full training procedure, and I'd love your insight.
From my understanding, the training flow would look something like:
Phase 1: Overparameterized Pre-training
- Start with 4x key dimensions, 4x value dimensions
- Use your n^(1/4) initialization scaling
- Train on standard language modeling objective (next token prediction)
Question:
Do you use any special attention dropout or regularization here?
I'm wondering if the overparameterization needs stabilization techniques.
Phase 2: Lottery Ticket Identification
- Apply iterative magnitude pruning (IMP) to attention weight matrices
- Calculate pruning thresholds based on weight magnitudes
Question:
Do you prune Q, K, V, and O projections at the same rate, or does each need different sparsity?
I imagine K/V with 4x dimensions can tolerate more aggressive pruning.
Phase 3: Sparse Retraining
- Rewind to initial (scaled) weights
- Apply discovered masks
- Retrain with masks frozen
Question:
Do you find the sparse network needs the same number of training steps, or can it converge faster?
The part I'm most curious about is how this interacts with multi-head attention specifically.
Does each head discover its own lottery ticket independently?
Or do you need to maintain some kind of head-wise balance?
Also wondering about the computational trade-offs during training — obviously the overparameterized phase is more expensive, but if the lottery ticket is strong enough, does the sparse retraining compensate?
What's the total training cost compared to just training a standard dense model?
And one more practical question:
For the 4x overparameterization, is that dimension increase applied before or after the head split?
Like, if you have 12 heads with 64-dim each (768 total), do you go to 768×4 = 3072 before splitting into heads, or 64×4 = 256 per head?
The reason I'm digging into this is that I think your approach could be a game-changer for how we think about training large models:
Train big, compress smart, deploy efficiently — but the devil's in the details of making it actually work.
Looking forward to your thoughts, and seriously, keep pushing this work forward!
Cheers,
Ujjwal Tyagi
AI Researcher & Scientist, Shirova AI
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper