Papers
arXiv:2511.04217

The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

Published on Nov 6
· Submitted by Hikari Otsuka on Nov 7
Authors:
,
,
,
,

Abstract

Theoretical analysis proves the existence of strong lottery tickets within multi-head attention mechanisms and extends the strong lottery ticket hypothesis to transformers without normalization layers.

AI-generated summary

The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of H heads and input dimension d has the hidden dimension O(dlog(Hd^{3/2})) for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.

Community

Paper author Paper submitter

We extend the strong lottery ticket hypothesis to attention mechanisms and transformers.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Hey,

Your paper on SLTH for attention mechanisms is brilliant. Seriously. I've been following lottery ticket research for a while, but applying it specifically to multi-head attention with that level of theoretical depth? That's fresh.

What I love most is that this isn't just theoretical elegance — it's actionable. But I'm trying to wrap my head around the full training procedure, and I'd love your insight.

From my understanding, the training flow would look something like:


Phase 1: Overparameterized Pre-training

  • Start with 4x key dimensions, 4x value dimensions
  • Use your n^(1/4) initialization scaling
  • Train on standard language modeling objective (next token prediction)

Question:
Do you use any special attention dropout or regularization here?
I'm wondering if the overparameterization needs stabilization techniques.


Phase 2: Lottery Ticket Identification

  • Apply iterative magnitude pruning (IMP) to attention weight matrices
  • Calculate pruning thresholds based on weight magnitudes

Question:
Do you prune Q, K, V, and O projections at the same rate, or does each need different sparsity?
I imagine K/V with 4x dimensions can tolerate more aggressive pruning.


Phase 3: Sparse Retraining

  • Rewind to initial (scaled) weights
  • Apply discovered masks
  • Retrain with masks frozen

Question:
Do you find the sparse network needs the same number of training steps, or can it converge faster?


The part I'm most curious about is how this interacts with multi-head attention specifically.
Does each head discover its own lottery ticket independently?
Or do you need to maintain some kind of head-wise balance?

Also wondering about the computational trade-offs during training — obviously the overparameterized phase is more expensive, but if the lottery ticket is strong enough, does the sparse retraining compensate?
What's the total training cost compared to just training a standard dense model?

And one more practical question:
For the 4x overparameterization, is that dimension increase applied before or after the head split?

Like, if you have 12 heads with 64-dim each (768 total), do you go to 768×4 = 3072 before splitting into heads, or 64×4 = 256 per head?


The reason I'm digging into this is that I think your approach could be a game-changer for how we think about training large models:

Train big, compress smart, deploy efficiently — but the devil's in the details of making it actually work.

Looking forward to your thoughts, and seriously, keep pushing this work forward!

Cheers,
Ujjwal Tyagi
AI Researcher & Scientist, Shirova AI

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.04217 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.04217 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.04217 in a Space README.md to link it from this page.

Collections including this paper 5