Guanghua commited on
Commit
296f7a4
·
1 Parent(s): 943319e

fixed the text, images are not easy to fix

Browse files
Files changed (1) hide show
  1. src/index.html +5 -4
src/index.html CHANGED
@@ -896,17 +896,16 @@
896
  <ul>
897
  <li>Forward pass with the same, full set of bf16 parameters on each replica, but different microbatches across replicas</li>
898
  <li>Backward pass with the same, full set of gradients on each replica, but different microbatches across replicas</li>
899
- <li>Perform an reduce-scatter on the gradients (we'll explain the reduce-scatter primitive in the graph below)</li>
900
  <li>Each replica perform an optimizer step on its local optimizer steps (only <d-math>\frac{1}{N_d}</d-math> optimizer states) to get updated <d-math>\frac{1}{N_d}</d-math> fp32 parameters which can then be converted to <d-math>\frac{1}{N_d}</d-math> of the full set of bf16 parameters.</li>
901
  <li>Perform an all-gather among the bf16 parameters to send missing slices back to each replica. This is a new operation in ZeRO, and not used in vanilla DP.</li>
902
  </ul>
903
- <aside>Note: reduce-scatter is 2 times faster than all reduce! <em>Yay, a third communication primitive!</em></aside>
904
 
905
- <p>You may be wondering what is this "reduce-scatter" operation and how this all look so lets try to make this more graphical with the figure below. We'll go over all the steps of a forward/backward pass cycle:</p>
906
 
907
  <p><img alt="dp_zero1.gif" src="/assets/images/dp_zero1.gif" /></p>
908
 
909
- <p>In terms of practical communications, compared to vanilla DP, Zero-1 change our "all-reduce" gradient communication to a "reduce-scatter" operation and adds an all-gather operation over all parameters after the optimizer step. Here is how it looks:</p>
910
 
911
  <p><img alt="dp_zero1_overlap.svg" src="/assets/images/dp_zero1_overlap.svg" /></p>
912
 
@@ -940,6 +939,8 @@
940
 
941
  <p><img alt="dp_zero2_overlap.svg" src="/assets/images/dp_zero2_overlap.svg" /></p>
942
 
 
 
943
  <aside>Note: You might notice that there is no real overhead of using ZeRO-2 over ZeRO-1 and indeed ZeRO-2 is usually the best option.</aside>
944
 
945
  <p>Now that we’ve sharded gradients as well, are we done or can we keep getting away with this? Well, sort of. Here comes ZeRO-3!</p>
 
896
  <ul>
897
  <li>Forward pass with the same, full set of bf16 parameters on each replica, but different microbatches across replicas</li>
898
  <li>Backward pass with the same, full set of gradients on each replica, but different microbatches across replicas</li>
899
+ <li>Perform an all-reduce on the gradients (we'll explain the all-reduce primitive in the graph below)</li>
900
  <li>Each replica perform an optimizer step on its local optimizer steps (only <d-math>\frac{1}{N_d}</d-math> optimizer states) to get updated <d-math>\frac{1}{N_d}</d-math> fp32 parameters which can then be converted to <d-math>\frac{1}{N_d}</d-math> of the full set of bf16 parameters.</li>
901
  <li>Perform an all-gather among the bf16 parameters to send missing slices back to each replica. This is a new operation in ZeRO, and not used in vanilla DP.</li>
902
  </ul>
 
903
 
904
+ <p>Here is a more graphical representation of the above steps with the figure below. We'll go over all the steps of a forward/backward pass cycle:</p>
905
 
906
  <p><img alt="dp_zero1.gif" src="/assets/images/dp_zero1.gif" /></p>
907
 
908
+ <p>In terms of practical communications, compared to vanilla DP, Zero-1 uses "all-reduce" for gradient communication, updates parameters corresponding to the optimizer state shard, and adds an all-gather operation over all parameters after the optimizer step. Here is how it looks:</p>
909
 
910
  <p><img alt="dp_zero1_overlap.svg" src="/assets/images/dp_zero1_overlap.svg" /></p>
911
 
 
939
 
940
  <p><img alt="dp_zero2_overlap.svg" src="/assets/images/dp_zero2_overlap.svg" /></p>
941
 
942
+ <aside>Note: reduce-scatter is 2 times faster than all-reduce! <em>Yay, a third communication primitive!</em></aside>
943
+
944
  <aside>Note: You might notice that there is no real overhead of using ZeRO-2 over ZeRO-1 and indeed ZeRO-2 is usually the best option.</aside>
945
 
946
  <p>Now that we’ve sharded gradients as well, are we done or can we keep getting away with this? Well, sort of. Here comes ZeRO-3!</p>