Fix description of Zero-1
#93
by
Guanghua
- opened
- src/index.html +5 -4
src/index.html
CHANGED
@@ -896,17 +896,16 @@
|
|
896 |
<ul>
|
897 |
<li>Forward pass with the same, full set of bf16 parameters on each replica, but different microbatches across replicas</li>
|
898 |
<li>Backward pass with the same, full set of gradients on each replica, but different microbatches across replicas</li>
|
899 |
-
<li>Perform an reduce
|
900 |
<li>Each replica perform an optimizer step on its local optimizer steps (only <d-math>\frac{1}{N_d}</d-math> optimizer states) to get updated <d-math>\frac{1}{N_d}</d-math> fp32 parameters which can then be converted to <d-math>\frac{1}{N_d}</d-math> of the full set of bf16 parameters.</li>
|
901 |
<li>Perform an all-gather among the bf16 parameters to send missing slices back to each replica. This is a new operation in ZeRO, and not used in vanilla DP.</li>
|
902 |
</ul>
|
903 |
-
<aside>Note: reduce-scatter is 2 times faster than all reduce! <em>Yay, a third communication primitive!</em></aside>
|
904 |
|
905 |
-
<p>
|
906 |
|
907 |
<p><img alt="dp_zero1.gif" src="/assets/images/dp_zero1.gif" /></p>
|
908 |
|
909 |
-
<p>In terms of practical communications, compared to vanilla DP, Zero-1
|
910 |
|
911 |
<p><img alt="dp_zero1_overlap.svg" src="/assets/images/dp_zero1_overlap.svg" /></p>
|
912 |
|
@@ -940,6 +939,8 @@
|
|
940 |
|
941 |
<p><img alt="dp_zero2_overlap.svg" src="/assets/images/dp_zero2_overlap.svg" /></p>
|
942 |
|
|
|
|
|
943 |
<aside>Note: You might notice that there is no real overhead of using ZeRO-2 over ZeRO-1 and indeed ZeRO-2 is usually the best option.</aside>
|
944 |
|
945 |
<p>Now that we’ve sharded gradients as well, are we done or can we keep getting away with this? Well, sort of. Here comes ZeRO-3!</p>
|
|
|
896 |
<ul>
|
897 |
<li>Forward pass with the same, full set of bf16 parameters on each replica, but different microbatches across replicas</li>
|
898 |
<li>Backward pass with the same, full set of gradients on each replica, but different microbatches across replicas</li>
|
899 |
+
<li>Perform an all-reduce on the gradients (we'll explain the all-reduce primitive in the graph below)</li>
|
900 |
<li>Each replica perform an optimizer step on its local optimizer steps (only <d-math>\frac{1}{N_d}</d-math> optimizer states) to get updated <d-math>\frac{1}{N_d}</d-math> fp32 parameters which can then be converted to <d-math>\frac{1}{N_d}</d-math> of the full set of bf16 parameters.</li>
|
901 |
<li>Perform an all-gather among the bf16 parameters to send missing slices back to each replica. This is a new operation in ZeRO, and not used in vanilla DP.</li>
|
902 |
</ul>
|
|
|
903 |
|
904 |
+
<p>Here is a more graphical representation of the above steps with the figure below. We'll go over all the steps of a forward/backward pass cycle:</p>
|
905 |
|
906 |
<p><img alt="dp_zero1.gif" src="/assets/images/dp_zero1.gif" /></p>
|
907 |
|
908 |
+
<p>In terms of practical communications, compared to vanilla DP, Zero-1 uses "all-reduce" for gradient communication, updates parameters corresponding to the optimizer state shard, and adds an all-gather operation over all parameters after the optimizer step. Here is how it looks:</p>
|
909 |
|
910 |
<p><img alt="dp_zero1_overlap.svg" src="/assets/images/dp_zero1_overlap.svg" /></p>
|
911 |
|
|
|
939 |
|
940 |
<p><img alt="dp_zero2_overlap.svg" src="/assets/images/dp_zero2_overlap.svg" /></p>
|
941 |
|
942 |
+
<aside>Note: reduce-scatter is 2 times faster than all-reduce! <em>Yay, a third communication primitive!</em></aside>
|
943 |
+
|
944 |
<aside>Note: You might notice that there is no real overhead of using ZeRO-2 over ZeRO-1 and indeed ZeRO-2 is usually the best option.</aside>
|
945 |
|
946 |
<p>Now that we’ve sharded gradients as well, are we done or can we keep getting away with this? Well, sort of. Here comes ZeRO-3!</p>
|