TP self attention figure
First of all, thank you for this awesome blog post and all the knowledge share! This is an amazing piece of work by HF team!!
I'd like to point a typo in this image and suggest a text improvement:
https://nanotron-ultrascale-playbook.static.hf.space/assets/images/tp_full_diagram.png
Assuming the picture is representing GPUs processing different attention heads. Then it should perform an all-gather to concatenate the attention results, rather than an all-reduce.
Also, I think the text should be a bit more clear. When I first read "Query (Q), Key (K), and Value (V) matrices are split in a column-parallel fashion", I though you were splitting one attention head across multiple GPUs. Doing that would actually require a totally different implementation.