TP self attention figure

#120
by lorenzocc - opened

First of all, thank you for this awesome blog post and all the knowledge share! This is an amazing piece of work by HF team!!

I'd like to point a typo in this image and suggest a text improvement:
https://nanotron-ultrascale-playbook.static.hf.space/assets/images/tp_full_diagram.png

Assuming the picture is representing GPUs processing different attention heads. Then it should perform an all-gather to concatenate the attention results, rather than an all-reduce.

Also, I think the text should be a bit more clear. When I first read "Query (Q), Key (K), and Value (V) matrices are split in a column-parallel fashion", I though you were splitting one attention head across multiple GPUs. Doing that would actually require a totally different implementation.

Sign up or log in to comment