nanotron/ultrascale-playbook

Jun 17

I'm having a difficult time understanding this quote, especially since All Gather was not referenced at all in the previous section:"One interesting note about layer normalization in tensor parallel training - since each TP rank sees the same activations after the all-gather, the layer norm weights don't actually need an all-reduce to sync their gradients after the backward pass. They naturally stay in sync across ranks. However, for dropout operations, we must make sure to sync the random seed across TP ranks to maintain deterministic behavior."

This is what GPT says, but I still don't get it:
"LayerNorm weights don’t need gradient sync: after TP’s final all-gather you have identical activations on every rank, so each rank’s LayerNorm sees the same inputs and computes the same gradient for its (shared) scale & shift parameters—no extra all-reduce needed."

kyars

Jun 17

@nouamanetazi @eliebak

kyars

Jun 18

Can you also please explain this embedding image, there's no context to it

Spaces:

nanotron
/

ultrascale-playbook

Running

TP Question