Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

112

Fix description of Zero-1

#93

by Guanghua - opened Mar 2

base: refs/heads/main

←

from: refs/pr/93

Discussion Files changed

-4

This PR is in draft mode

Files changed (1) hide show

src/index.html +5 -4

src/index.html CHANGED Viewed

@@ -896,17 +896,16 @@
         <ul>
             <li>Forward pass with the same, full set of bf16 parameters on each replica, but different microbatches across replicas</li>
             <li>Backward pass with the same, full set of gradients on each replica, but different microbatches across replicas</li>
-            <li>Perform an reduce-scatter on the gradients (we'll explain the reduce-scatter primitive in the graph below)</li>
             <li>Each replica perform an optimizer step on its local optimizer steps (only <d-math>\frac{1}{N_d}</d-math> optimizer states) to get updated <d-math>\frac{1}{N_d}</d-math> fp32 parameters which can then be converted to <d-math>\frac{1}{N_d}</d-math> of the full set of bf16 parameters.</li>
             <li>Perform an all-gather among the bf16 parameters to send missing slices back to each replica. This is a new operation in ZeRO, and not used in vanilla DP.</li>
         </ul>
-        <aside>Note: reduce-scatter is 2 times faster than all reduce! <em>Yay, a third communication primitive!</em></aside>
-        <p>You may be wondering what is this "reduce-scatter" operation and how this all look so lets try to make this more graphical with the figure below. We'll go over all the steps of a forward/backward pass cycle:</p>
         <p><img alt="dp_zero1.gif" src="/assets/images/dp_zero1.gif" /></p>
-        <p>In terms of practical communications, compared to vanilla DP, Zero-1 change our "all-reduce" gradient communication to a "reduce-scatter" operation and adds an all-gather operation over all parameters after the optimizer step. Here is how it looks:</p>
         <p><img alt="dp_zero1_overlap.svg" src="/assets/images/dp_zero1_overlap.svg" /></p>
@@ -940,6 +939,8 @@
         <p><img alt="dp_zero2_overlap.svg" src="/assets/images/dp_zero2_overlap.svg" /></p>
         <aside>Note: You might notice that there is no real overhead of using ZeRO-2 over ZeRO-1 and indeed ZeRO-2 is usually the best option.</aside>
         <p>Now that we’ve sharded gradients as well, are we done or can we keep getting away with this? Well, sort of. Here comes ZeRO-3!</p>

         <ul>
             <li>Forward pass with the same, full set of bf16 parameters on each replica, but different microbatches across replicas</li>
             <li>Backward pass with the same, full set of gradients on each replica, but different microbatches across replicas</li>
+            <li>Perform an all-reduce on the gradients (we'll explain the all-reduce primitive in the graph below)</li>
             <li>Each replica perform an optimizer step on its local optimizer steps (only <d-math>\frac{1}{N_d}</d-math> optimizer states) to get updated <d-math>\frac{1}{N_d}</d-math> fp32 parameters which can then be converted to <d-math>\frac{1}{N_d}</d-math> of the full set of bf16 parameters.</li>
             <li>Perform an all-gather among the bf16 parameters to send missing slices back to each replica. This is a new operation in ZeRO, and not used in vanilla DP.</li>
         </ul>
+        <p>Here is a more graphical representation of the above steps with the figure below. We'll go over all the steps of a forward/backward pass cycle:</p>
         <p><img alt="dp_zero1.gif" src="/assets/images/dp_zero1.gif" /></p>
+        <p>In terms of practical communications, compared to vanilla DP, Zero-1 uses "all-reduce" for gradient communication, updates parameters corresponding to the optimizer state shard, and adds an all-gather operation over all parameters after the optimizer step. Here is how it looks:</p>
         <p><img alt="dp_zero1_overlap.svg" src="/assets/images/dp_zero1_overlap.svg" /></p>
         <p><img alt="dp_zero2_overlap.svg" src="/assets/images/dp_zero2_overlap.svg" /></p>
+        <aside>Note: reduce-scatter is 2 times faster than all-reduce! <em>Yay, a third communication primitive!</em></aside>
         <aside>Note: You might notice that there is no real overhead of using ZeRO-2 over ZeRO-1 and indeed ZeRO-2 is usually the best option.</aside>
         <p>Now that we’ve sharded gradients as well, are we done or can we keep getting away with this? Well, sort of. Here comes ZeRO-3!</p>