nouamanetazi HF staff commited on
Commit
cb40edf
·
1 Parent(s): bc52030
assets/images/5Dparallelism_8Bmemoryusage.svg CHANGED
dist/assets/images/5Dparallelism_8Bmemoryusage.svg CHANGED
dist/index.html CHANGED
@@ -286,7 +286,7 @@
286
 
287
  <p>So how can I quickly determine memory usage from these variable? One simple way is to do this empirically and just measure it.</p>
288
 
289
- <h4>Memory profiling a training step</h4>
290
 
291
  <p>Using this snippet [TODO: link to appendix A5], we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
292
 
@@ -315,7 +315,7 @@
315
  N = h * v + L * (12 * h^2 + 13 * h) + 2*h
316
  </d-math>
317
 
318
- <aside>We excluded the positional embedding count as rotary embeddings are not learned.</aside>
319
 
320
  <p>In that equation, <d-math>h</d-math> is the hidden dimension, <d-math>v</d-math> the vocabulary size, and <d-math>L</d-math> the number of layers in the model. Note that looking at the equation we can see that the term that will dominate at large hidden dimensions is the <d-math>h^2</d-math> term since it’s the only one growing quadratically as we scale the parameters.</p>
321
 
@@ -348,7 +348,13 @@
348
  <p class="note-box-title">📝 Note</p>
349
  <p class="note-box-content">
350
  Some libraries store grads in fp32 which would require an additional <d-math>m_{params\_fp32} = 4 * N</d-math> memory. This is done for example in nanotron, because <code>bf16</code> is lossy for smaller values and we always prioritize stability. See <a href="https://github.com/microsoft/DeepSpeed/issues/1773">this DeepSpeed issue</a> for more information.
 
 
351
 
 
 
 
 
352
  </p>
353
  </div>
354
 
@@ -400,7 +406,7 @@
400
  <p>Activation memory is a bit more complex to compute than the weights, gradients and optimizer states, in part because it depends on the inputs of the model. If you’re unsure why we even need to store activations for the backward pass, <a href="https://www.determined.ai/blog/act-mem-2">this reference</a> is a good quick refresh. After a careful inspection of how backward pass is computed we can estimate the total memory required for the activations in mixed precision and we arrive at the following equation:</p>
401
 
402
  <d-math block>
403
- m_{act} = L<em> seq * bs * h * (34 + \frac{5</em>n_{heads}*seq}{h})</p>
404
  </d-math>
405
 
406
  <p>Here <d-math>L</d-math> is the number of layers, <d-math>seq</d-math> the sequence length, <d-math>bs</d-math> the batch size in samples, <d-math>h</d-math> the hidden dimension of the model and <d-math>n_{heads}</d-math> the number of heads.</p>
@@ -500,9 +506,48 @@
500
 
501
  <p>But if you’ve carefully followed, you probably noticed that the forward/backward passes for each micro-batch can actually be run in parallel. Forward/backward passes are independent from each other, with independent input samples being the only difference. Seems like it’s time to start extending our training to more than one GPU! </p>
502
 
503
- <p>Let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which is just a parallel version of gradient accumulation</em>.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
504
 
505
- <p><strong>TODO: add profiling here or not?</strong></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
506
 
507
  <h2>Data Parallelism</h2>
508
 
@@ -1584,7 +1629,7 @@
1584
 
1585
  <p>And to have an idea of the memory benefits of each parallelism:</p>
1586
 
1587
- <p><img alt="image.png" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" /></p>
1588
 
1589
  <h2>How to Find the Best Training Configuration</h2>
1590
 
 
286
 
287
  <p>So how can I quickly determine memory usage from these variable? One simple way is to do this empirically and just measure it.</p>
288
 
289
+ <h4>Profiling the memory usage</h4>
290
 
291
  <p>Using this snippet [TODO: link to appendix A5], we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
292
 
 
315
  N = h * v + L * (12 * h^2 + 13 * h) + 2*h
316
  </d-math>
317
 
318
+ <aside>We excluded the positional embedding count as we're not using fixed positional embeddings.</aside>
319
 
320
  <p>In that equation, <d-math>h</d-math> is the hidden dimension, <d-math>v</d-math> the vocabulary size, and <d-math>L</d-math> the number of layers in the model. Note that looking at the equation we can see that the term that will dominate at large hidden dimensions is the <d-math>h^2</d-math> term since it’s the only one growing quadratically as we scale the parameters.</p>
321
 
 
348
  <p class="note-box-title">📝 Note</p>
349
  <p class="note-box-content">
350
  Some libraries store grads in fp32 which would require an additional <d-math>m_{params\_fp32} = 4 * N</d-math> memory. This is done for example in nanotron, because <code>bf16</code> is lossy for smaller values and we always prioritize stability. See <a href="https://github.com/microsoft/DeepSpeed/issues/1773">this DeepSpeed issue</a> for more information.
351
+ </p>
352
+ </div>
353
 
354
+ <div class="note-box">
355
+ <p class="note-box-title">📝 Note</p>
356
+ <p class="note-box-content">
357
+ The FP32 copy of parameters (<d-math>m_{params\_fp32}</d-math>) is sometimes called "master weights" in the literature and codebases.
358
  </p>
359
  </div>
360
 
 
406
  <p>Activation memory is a bit more complex to compute than the weights, gradients and optimizer states, in part because it depends on the inputs of the model. If you’re unsure why we even need to store activations for the backward pass, <a href="https://www.determined.ai/blog/act-mem-2">this reference</a> is a good quick refresh. After a careful inspection of how backward pass is computed we can estimate the total memory required for the activations in mixed precision and we arrive at the following equation:</p>
407
 
408
  <d-math block>
409
+ m_{act} = L \cdot seq \cdot bs \cdot h \cdot (34 + \frac{5 \cdot n_{heads} \cdot seq}{h})</p>
410
  </d-math>
411
 
412
  <p>Here <d-math>L</d-math> is the number of layers, <d-math>seq</d-math> the sequence length, <d-math>bs</d-math> the batch size in samples, <d-math>h</d-math> the hidden dimension of the model and <d-math>n_{heads}</d-math> the number of heads.</p>
 
506
 
507
  <p>But if you’ve carefully followed, you probably noticed that the forward/backward passes for each micro-batch can actually be run in parallel. Forward/backward passes are independent from each other, with independent input samples being the only difference. Seems like it’s time to start extending our training to more than one GPU! </p>
508
 
509
+ <h4>Profiling GPU compute and communication</h4>
510
+
511
+ <p>But before then, we need another tool (and probably the most useful one) in our distributed training toolbox that would enable us to understand and validate the communications between GPUs and compute happening in each.</p>
512
+
513
+ <p>PyTorch's <a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html">profiler</a> allows us to trace and visualize exactly what's happening on both CPU and GPU during training. Let's see how to use it:</p>
514
+
515
+ <d-code block language="python">
516
+ with torch.profiler.profile(
517
+ activities=[
518
+ torch.profiler.ProfilerActivity.CPU,
519
+ torch.profiler.ProfilerActivity.CUDA,
520
+ ],
521
+ schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
522
+ on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/profile'),
523
+ with_stack=True
524
+ ) as prof:
525
+ for step in range(steps):
526
+ train_step()
527
+ prof.step()</d-code>
528
+
529
+ <p>This generates a trace that we can visualize in TensorBoard or Chrome's trace viewer. The trace shows:</p>
530
 
531
+ <ul>
532
+ <li>CPU thread launching kernels asynchronously to GPU</li>
533
+ <li>Multiple CUDA streams handling compute and communication in parallel</li>
534
+ <li>Kernel execution times and memory allocation</li>
535
+ </ul>
536
+
537
+ <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
538
+ <p>Figure: Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p>
539
+
540
+ <p>The trace helps identify bottlenecks like:</p>
541
+ <ul>
542
+ <li>Sequential compute and communication that could be overlapped</li>
543
+ <li>Idle GPU time waiting for data transfers</li>
544
+ <li>Memory movement between CPU and GPU</li>
545
+ <li>Kernel launch overhead from CPU</li>
546
+ </ul>
547
+
548
+ <p>Understanding these patterns is crucial for optimizing distributed training performance. For example, the trace would clearly show if gradient synchronization is properly overlapped with backward computation as we'll discuss later.</p>
549
+
550
+ <p>Let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which is just a parallel version of gradient accumulation</em>.</p>
551
 
552
  <h2>Data Parallelism</h2>
553
 
 
1629
 
1630
  <p>And to have an idea of the memory benefits of each parallelism:</p>
1631
 
1632
+ <div class="l-page"><img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" /></div>
1633
 
1634
  <h2>How to Find the Best Training Configuration</h2>
1635
 
src/index.html CHANGED
@@ -286,7 +286,7 @@
286
 
287
  <p>So how can I quickly determine memory usage from these variable? One simple way is to do this empirically and just measure it.</p>
288
 
289
- <h4>Memory profiling a training step</h4>
290
 
291
  <p>Using this snippet [TODO: link to appendix A5], we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
292
 
@@ -315,7 +315,7 @@
315
  N = h * v + L * (12 * h^2 + 13 * h) + 2*h
316
  </d-math>
317
 
318
- <aside>We excluded the positional embedding count as rotary embeddings are not learned.</aside>
319
 
320
  <p>In that equation, <d-math>h</d-math> is the hidden dimension, <d-math>v</d-math> the vocabulary size, and <d-math>L</d-math> the number of layers in the model. Note that looking at the equation we can see that the term that will dominate at large hidden dimensions is the <d-math>h^2</d-math> term since it’s the only one growing quadratically as we scale the parameters.</p>
321
 
@@ -348,7 +348,13 @@
348
  <p class="note-box-title">📝 Note</p>
349
  <p class="note-box-content">
350
  Some libraries store grads in fp32 which would require an additional <d-math>m_{params\_fp32} = 4 * N</d-math> memory. This is done for example in nanotron, because <code>bf16</code> is lossy for smaller values and we always prioritize stability. See <a href="https://github.com/microsoft/DeepSpeed/issues/1773">this DeepSpeed issue</a> for more information.
 
 
351
 
 
 
 
 
352
  </p>
353
  </div>
354
 
@@ -400,7 +406,7 @@
400
  <p>Activation memory is a bit more complex to compute than the weights, gradients and optimizer states, in part because it depends on the inputs of the model. If you’re unsure why we even need to store activations for the backward pass, <a href="https://www.determined.ai/blog/act-mem-2">this reference</a> is a good quick refresh. After a careful inspection of how backward pass is computed we can estimate the total memory required for the activations in mixed precision and we arrive at the following equation:</p>
401
 
402
  <d-math block>
403
- m_{act} = L<em> seq * bs * h * (34 + \frac{5</em>n_{heads}*seq}{h})</p>
404
  </d-math>
405
 
406
  <p>Here <d-math>L</d-math> is the number of layers, <d-math>seq</d-math> the sequence length, <d-math>bs</d-math> the batch size in samples, <d-math>h</d-math> the hidden dimension of the model and <d-math>n_{heads}</d-math> the number of heads.</p>
@@ -500,9 +506,48 @@
500
 
501
  <p>But if you’ve carefully followed, you probably noticed that the forward/backward passes for each micro-batch can actually be run in parallel. Forward/backward passes are independent from each other, with independent input samples being the only difference. Seems like it’s time to start extending our training to more than one GPU! </p>
502
 
503
- <p>Let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which is just a parallel version of gradient accumulation</em>.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
504
 
505
- <p><strong>TODO: add profiling here or not?</strong></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
506
 
507
  <h2>Data Parallelism</h2>
508
 
@@ -1584,7 +1629,7 @@
1584
 
1585
  <p>And to have an idea of the memory benefits of each parallelism:</p>
1586
 
1587
- <p><img alt="image.png" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" /></p>
1588
 
1589
  <h2>How to Find the Best Training Configuration</h2>
1590
 
 
286
 
287
  <p>So how can I quickly determine memory usage from these variable? One simple way is to do this empirically and just measure it.</p>
288
 
289
+ <h4>Profiling the memory usage</h4>
290
 
291
  <p>Using this snippet [TODO: link to appendix A5], we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
292
 
 
315
  N = h * v + L * (12 * h^2 + 13 * h) + 2*h
316
  </d-math>
317
 
318
+ <aside>We excluded the positional embedding count as we're not using fixed positional embeddings.</aside>
319
 
320
  <p>In that equation, <d-math>h</d-math> is the hidden dimension, <d-math>v</d-math> the vocabulary size, and <d-math>L</d-math> the number of layers in the model. Note that looking at the equation we can see that the term that will dominate at large hidden dimensions is the <d-math>h^2</d-math> term since it’s the only one growing quadratically as we scale the parameters.</p>
321
 
 
348
  <p class="note-box-title">📝 Note</p>
349
  <p class="note-box-content">
350
  Some libraries store grads in fp32 which would require an additional <d-math>m_{params\_fp32} = 4 * N</d-math> memory. This is done for example in nanotron, because <code>bf16</code> is lossy for smaller values and we always prioritize stability. See <a href="https://github.com/microsoft/DeepSpeed/issues/1773">this DeepSpeed issue</a> for more information.
351
+ </p>
352
+ </div>
353
 
354
+ <div class="note-box">
355
+ <p class="note-box-title">📝 Note</p>
356
+ <p class="note-box-content">
357
+ The FP32 copy of parameters (<d-math>m_{params\_fp32}</d-math>) is sometimes called "master weights" in the literature and codebases.
358
  </p>
359
  </div>
360
 
 
406
  <p>Activation memory is a bit more complex to compute than the weights, gradients and optimizer states, in part because it depends on the inputs of the model. If you’re unsure why we even need to store activations for the backward pass, <a href="https://www.determined.ai/blog/act-mem-2">this reference</a> is a good quick refresh. After a careful inspection of how backward pass is computed we can estimate the total memory required for the activations in mixed precision and we arrive at the following equation:</p>
407
 
408
  <d-math block>
409
+ m_{act} = L \cdot seq \cdot bs \cdot h \cdot (34 + \frac{5 \cdot n_{heads} \cdot seq}{h})</p>
410
  </d-math>
411
 
412
  <p>Here <d-math>L</d-math> is the number of layers, <d-math>seq</d-math> the sequence length, <d-math>bs</d-math> the batch size in samples, <d-math>h</d-math> the hidden dimension of the model and <d-math>n_{heads}</d-math> the number of heads.</p>
 
506
 
507
  <p>But if you’ve carefully followed, you probably noticed that the forward/backward passes for each micro-batch can actually be run in parallel. Forward/backward passes are independent from each other, with independent input samples being the only difference. Seems like it’s time to start extending our training to more than one GPU! </p>
508
 
509
+ <h4>Profiling GPU compute and communication</h4>
510
+
511
+ <p>But before then, we need another tool (and probably the most useful one) in our distributed training toolbox that would enable us to understand and validate the communications between GPUs and compute happening in each.</p>
512
+
513
+ <p>PyTorch's <a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html">profiler</a> allows us to trace and visualize exactly what's happening on both CPU and GPU during training. Let's see how to use it:</p>
514
+
515
+ <d-code block language="python">
516
+ with torch.profiler.profile(
517
+ activities=[
518
+ torch.profiler.ProfilerActivity.CPU,
519
+ torch.profiler.ProfilerActivity.CUDA,
520
+ ],
521
+ schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
522
+ on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/profile'),
523
+ with_stack=True
524
+ ) as prof:
525
+ for step in range(steps):
526
+ train_step()
527
+ prof.step()</d-code>
528
+
529
+ <p>This generates a trace that we can visualize in TensorBoard or Chrome's trace viewer. The trace shows:</p>
530
 
531
+ <ul>
532
+ <li>CPU thread launching kernels asynchronously to GPU</li>
533
+ <li>Multiple CUDA streams handling compute and communication in parallel</li>
534
+ <li>Kernel execution times and memory allocation</li>
535
+ </ul>
536
+
537
+ <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
538
+ <p>Figure: Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p>
539
+
540
+ <p>The trace helps identify bottlenecks like:</p>
541
+ <ul>
542
+ <li>Sequential compute and communication that could be overlapped</li>
543
+ <li>Idle GPU time waiting for data transfers</li>
544
+ <li>Memory movement between CPU and GPU</li>
545
+ <li>Kernel launch overhead from CPU</li>
546
+ </ul>
547
+
548
+ <p>Understanding these patterns is crucial for optimizing distributed training performance. For example, the trace would clearly show if gradient synchronization is properly overlapped with backward computation as we'll discuss later.</p>
549
+
550
+ <p>Let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which is just a parallel version of gradient accumulation</em>.</p>
551
 
552
  <h2>Data Parallelism</h2>
553
 
 
1629
 
1630
  <p>And to have an idea of the memory benefits of each parallelism:</p>
1631
 
1632
+ <div class="l-page"><img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" /></div>
1633
 
1634
  <h2>How to Find the Best Training Configuration</h2>
1635