TP self attention figure

#120

by lorenzocc - opened 14 days ago

14 days ago

•

First of all, thank you for this awesome blog post and all the knowledge share! This is an amazing piece of work by HF team!!

I'd like to point a typo in this image and suggest a text improvement:
https://nanotron-ultrascale-playbook.static.hf.space/assets/images/tp_full_diagram.png

Assuming the picture is representing GPUs processing different attention heads. Then it should perform an all-gather to concatenate the attention results, rather than an all-reduce.

Also, I think the text should be a bit more clear. When I first read "Query (Q), Key (K), and Value (V) matrices are split in a column-parallel fashion", I though you were splitting one attention head across multiple GPUs. Doing that would actually require a totally different implementation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment