add references
Browse files- dist/index.html +181 -7
- src/index.html +181 -7
dist/index.html
CHANGED
|
@@ -2313,18 +2313,192 @@
|
|
| 2313 |
<h2>References</h2>
|
| 2314 |
|
| 2315 |
<h3>Landmark LLM Scaling Papers</h3>
|
| 2316 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2317 |
<h3>Training Frameworks</h3>
|
| 2318 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2319 |
<h3>Debugging</h3>
|
| 2320 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2321 |
<h3>Distribution Techniques</h3>
|
| 2322 |
-
|
| 2323 |
-
<
|
| 2324 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2325 |
<h3>Hardware</h3>
|
| 2326 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2327 |
<h3>Others</h3>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2328 |
|
| 2329 |
<h2>Appendix</h2>
|
| 2330 |
|
|
|
|
| 2313 |
<h2>References</h2>
|
| 2314 |
|
| 2315 |
<h3>Landmark LLM Scaling Papers</h3>
|
| 2316 |
+
|
| 2317 |
+
<div>
|
| 2318 |
+
<a href="https://arxiv.org/abs/1909.08053"><strong>Megatron-LM</strong></a>
|
| 2319 |
+
<p>Introduces tensor parallelism and efficient model parallelism techniques for training large language models.</p>
|
| 2320 |
+
</div>
|
| 2321 |
+
|
| 2322 |
+
<div>
|
| 2323 |
+
<a href="https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/"><strong>Megatron-Turing NLG 530B</strong></a>
|
| 2324 |
+
<p>Describes the training of a 530B parameter model using a combination of DeepSpeed and Megatron-LM frameworks.</p>
|
| 2325 |
+
</div>
|
| 2326 |
+
|
| 2327 |
+
<div>
|
| 2328 |
+
<a href="https://arxiv.org/abs/2204.02311"><strong>PaLM</strong></a>
|
| 2329 |
+
<p>Introduces Google's Pathways Language Model, demonstrating strong performance across hundreds of language tasks and reasoning capabilities.</p>
|
| 2330 |
+
</div>
|
| 2331 |
+
|
| 2332 |
+
<div>
|
| 2333 |
+
<a href="https://arxiv.org/abs/2312.11805"><strong>Gemini</strong></a>
|
| 2334 |
+
<p>Presents Google's multimodal model architecture capable of processing text, images, audio, and video inputs.</p>
|
| 2335 |
+
</div>
|
| 2336 |
+
|
| 2337 |
+
<div>
|
| 2338 |
+
<a href="https://arxiv.org/abs/2412.19437v1"><strong>DeepSeek-V3</strong></a>
|
| 2339 |
+
<p>DeepSeek's report on architecture and training of the DeepSeek-V3 model.</p>
|
| 2340 |
+
</div>
|
| 2341 |
+
|
| 2342 |
+
|
| 2343 |
<h3>Training Frameworks</h3>
|
| 2344 |
+
|
| 2345 |
+
<div>
|
| 2346 |
+
<a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
|
| 2347 |
+
<p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
|
| 2348 |
+
</div>
|
| 2349 |
+
|
| 2350 |
+
<div>
|
| 2351 |
+
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
| 2352 |
+
<p>NVIDIA's framework for training large language models with model and data parallelism.</p>
|
| 2353 |
+
</div>
|
| 2354 |
+
|
| 2355 |
+
<div>
|
| 2356 |
+
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
| 2357 |
+
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism techniques.</p>
|
| 2358 |
+
</div>
|
| 2359 |
+
|
| 2360 |
+
<div>
|
| 2361 |
+
<a href="https://colossalai.org/"><strong>ColossalAI</strong></a>
|
| 2362 |
+
<p>Integrated large-scale model training system with various optimization techniques.</p>
|
| 2363 |
+
</div>
|
| 2364 |
+
|
| 2365 |
+
<div>
|
| 2366 |
+
<a href="https://github.com/pytorch/torchtitan"><strong>torchtitan</strong></a>
|
| 2367 |
+
<p>A PyTorch native library for large model training.</p>
|
| 2368 |
+
</div>
|
| 2369 |
+
|
| 2370 |
+
<div>
|
| 2371 |
+
<a href="https://github.com/EleutherAI/gpt-neox"><strong>GPT-NeoX</strong></a>
|
| 2372 |
+
<p>EleutherAI's framework for training large language models, used to train GPT-NeoX-20B.</p>
|
| 2373 |
+
</div>
|
| 2374 |
+
|
| 2375 |
+
<div>
|
| 2376 |
+
<a href="https://github.com/Lightning-AI/litgpt"><strong>LitGPT</strong></a>
|
| 2377 |
+
<p>Lightning AI's implementation of state-of-the-art open-source LLMs with focus on reproducibility.</p>
|
| 2378 |
+
</div>
|
| 2379 |
+
|
| 2380 |
+
<div>
|
| 2381 |
+
<a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
|
| 2382 |
+
<p>Training language models across compute clusters with DiLoCo.</p>
|
| 2383 |
+
</div>
|
| 2384 |
+
|
| 2385 |
<h3>Debugging</h3>
|
| 2386 |
+
|
| 2387 |
+
<div>
|
| 2388 |
+
<a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"><strong>Speed profiling</strong></a>
|
| 2389 |
+
<p>Official PyTorch tutorial on using the profiler to analyze model performance and bottlenecks.</p>
|
| 2390 |
+
</div>
|
| 2391 |
+
|
| 2392 |
+
<div>
|
| 2393 |
+
<a href="https://pytorch.org/blog/understanding-gpu-memory-1/"><strong>Memory profiling</strong></a>
|
| 2394 |
+
<p>Comprehensive guide to understanding and optimizing GPU memory usage in PyTorch.</p>
|
| 2395 |
+
</div>
|
| 2396 |
+
|
| 2397 |
+
<div>
|
| 2398 |
+
<a href="https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"><strong>TensorBoard Profiler Tutorial</strong></a>
|
| 2399 |
+
<p>Guide to using TensorBoard's profiling tools for PyTorch models.</p>
|
| 2400 |
+
</div>
|
| 2401 |
+
|
| 2402 |
<h3>Distribution Techniques</h3>
|
| 2403 |
+
|
| 2404 |
+
<div>
|
| 2405 |
+
<a href="https://siboehm.com/articles/22/data-parallel-training"><strong>Data parallelism</strong></a>
|
| 2406 |
+
<p>Comprehensive explanation of data parallel training in deep learning.</p>
|
| 2407 |
+
</div>
|
| 2408 |
+
|
| 2409 |
+
<div>
|
| 2410 |
+
<a href="https://arxiv.org/abs/1910.02054"><strong>ZeRO</strong></a>
|
| 2411 |
+
<p>Introduces Zero Redundancy Optimizer for training large models with memory optimization.</p>
|
| 2412 |
+
</div>
|
| 2413 |
+
|
| 2414 |
+
<div>
|
| 2415 |
+
<a href="https://arxiv.org/abs/2304.11277"><strong>FSDP</strong></a>
|
| 2416 |
+
<p>Fully Sharded Data Parallel training implementation in PyTorch.</p>
|
| 2417 |
+
</div>
|
| 2418 |
+
|
| 2419 |
+
<div>
|
| 2420 |
+
<a href="https://arxiv.org/abs/2205.05198"><strong>Tensor and Sequence Parallelism + Selective Recomputation</strong></a>
|
| 2421 |
+
<p>Advanced techniques for efficient large-scale model training combining different parallelism strategies.</p>
|
| 2422 |
+
</div>
|
| 2423 |
+
|
| 2424 |
+
<div>
|
| 2425 |
+
<a href="https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/#pipeline_parallelism"><strong>Pipeline parallelism</strong></a>
|
| 2426 |
+
<p>NVIDIA's guide to implementing pipeline parallelism for large model training.</p>
|
| 2427 |
+
</div>
|
| 2428 |
+
|
| 2429 |
+
<div>
|
| 2430 |
+
<a href="https://arxiv.org/abs/2211.05953"><strong>Breadth first Pipeline Parallelism</strong></a>
|
| 2431 |
+
<p>Includes broad discussions around PP schedules.</p>
|
| 2432 |
+
</div>
|
| 2433 |
+
|
| 2434 |
+
<div>
|
| 2435 |
+
<a href="https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/"><strong>All-reduce</strong></a>
|
| 2436 |
+
<p>Detailed explanation of the ring all-reduce algorithm used in distributed training.</p>
|
| 2437 |
+
</div>
|
| 2438 |
+
|
| 2439 |
+
<div>
|
| 2440 |
+
<a href="https://github.com/zhuzilin/ring-flash-attention"><strong>Ring-flash-attention</strong></a>
|
| 2441 |
+
<p>Implementation of ring attention mechanism combined with flash attention for efficient training.</p>
|
| 2442 |
+
</div>
|
| 2443 |
+
|
| 2444 |
+
<div>
|
| 2445 |
+
<a href="https://coconut-mode.com/posts/ring-attention/"><strong>Ring attention tutorial</strong></a>
|
| 2446 |
+
<p>Tutorial explaining the concepts and implementation of ring attention.</p>
|
| 2447 |
+
</div>
|
| 2448 |
+
|
| 2449 |
+
<div>
|
| 2450 |
+
<a href="https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/#understanding-performance-tradeoff-between-zero-and-3d-parallelism"><strong>ZeRO and 3D</strong></a>
|
| 2451 |
+
<p>DeepSpeed's guide to understanding tradeoffs between ZeRO and 3D parallelism strategies.</p>
|
| 2452 |
+
</div>
|
| 2453 |
+
|
| 2454 |
+
<div>
|
| 2455 |
+
<a href="https://arxiv.org/abs/1710.03740"><strong>Mixed precision training</strong></a>
|
| 2456 |
+
<p>Introduces mixed precision training techniques for deep learning models.</p>
|
| 2457 |
+
</div>
|
| 2458 |
+
|
| 2459 |
<h3>Hardware</h3>
|
| 2460 |
+
|
| 2461 |
+
<div>
|
| 2462 |
+
<a href="https://www.arxiv.org/abs/2408.14158"><strong>Fire-Flyer - a 10,000 PCI chips cluster</strong></a>
|
| 2463 |
+
<p>DeepSeek's report on designing a cluster with 10k PCI GPUs.</p>
|
| 2464 |
+
</div>
|
| 2465 |
+
|
| 2466 |
+
<div>
|
| 2467 |
+
<a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/"><strong>Meta's 24k H100 Pods</strong></a>
|
| 2468 |
+
<p>Meta's detailed overview of their massive AI infrastructure built with NVIDIA H100 GPUs.</p>
|
| 2469 |
+
</div>
|
| 2470 |
+
|
| 2471 |
+
<div>
|
| 2472 |
+
<a href="https://www.semianalysis.com/p/100000-h100-clusters-power-network"><strong>Semianalysis - 100k H100 cluster</strong></a>
|
| 2473 |
+
<p>Analysis of large-scale H100 GPU clusters and their implications for AI infrastructure.</p>
|
| 2474 |
+
</div>
|
| 2475 |
+
|
| 2476 |
<h3>Others</h3>
|
| 2477 |
+
|
| 2478 |
+
<div>
|
| 2479 |
+
<a href="https://github.com/stas00/ml-engineering"><strong>Stas Bekman's Handbook</strong></a>
|
| 2480 |
+
<p>Comprehensive handbook covering various aspects of training LLMs.</p>
|
| 2481 |
+
</div>
|
| 2482 |
+
|
| 2483 |
+
<div>
|
| 2484 |
+
<a href="https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md"><strong>Bloom training chronicles</strong></a>
|
| 2485 |
+
<p>Detailed documentation of the BLOOM model training process and challenges.</p>
|
| 2486 |
+
</div>
|
| 2487 |
+
|
| 2488 |
+
<div>
|
| 2489 |
+
<a href="https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf"><strong>OPT logbook</strong></a>
|
| 2490 |
+
<p>Meta's detailed logbook documenting the training process of the OPT-175B model.</p>
|
| 2491 |
+
</div>
|
| 2492 |
+
|
| 2493 |
+
<div>
|
| 2494 |
+
<a href="https://www.harmdevries.com/post/model-size-vs-compute-overhead/"><strong>Harm's law for training smol models longer</strong></a>
|
| 2495 |
+
<p>Investigation into the relationship between model size and training overhead.</p>
|
| 2496 |
+
</div>
|
| 2497 |
+
|
| 2498 |
+
<div>
|
| 2499 |
+
<a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
|
| 2500 |
+
<p>Investigation into long context training in terms of data and training cost.</p>
|
| 2501 |
+
</div>
|
| 2502 |
|
| 2503 |
<h2>Appendix</h2>
|
| 2504 |
|
src/index.html
CHANGED
|
@@ -2313,18 +2313,192 @@
|
|
| 2313 |
<h2>References</h2>
|
| 2314 |
|
| 2315 |
<h3>Landmark LLM Scaling Papers</h3>
|
| 2316 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2317 |
<h3>Training Frameworks</h3>
|
| 2318 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2319 |
<h3>Debugging</h3>
|
| 2320 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2321 |
<h3>Distribution Techniques</h3>
|
| 2322 |
-
|
| 2323 |
-
<
|
| 2324 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2325 |
<h3>Hardware</h3>
|
| 2326 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2327 |
<h3>Others</h3>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2328 |
|
| 2329 |
<h2>Appendix</h2>
|
| 2330 |
|
|
|
|
| 2313 |
<h2>References</h2>
|
| 2314 |
|
| 2315 |
<h3>Landmark LLM Scaling Papers</h3>
|
| 2316 |
+
|
| 2317 |
+
<div>
|
| 2318 |
+
<a href="https://arxiv.org/abs/1909.08053"><strong>Megatron-LM</strong></a>
|
| 2319 |
+
<p>Introduces tensor parallelism and efficient model parallelism techniques for training large language models.</p>
|
| 2320 |
+
</div>
|
| 2321 |
+
|
| 2322 |
+
<div>
|
| 2323 |
+
<a href="https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/"><strong>Megatron-Turing NLG 530B</strong></a>
|
| 2324 |
+
<p>Describes the training of a 530B parameter model using a combination of DeepSpeed and Megatron-LM frameworks.</p>
|
| 2325 |
+
</div>
|
| 2326 |
+
|
| 2327 |
+
<div>
|
| 2328 |
+
<a href="https://arxiv.org/abs/2204.02311"><strong>PaLM</strong></a>
|
| 2329 |
+
<p>Introduces Google's Pathways Language Model, demonstrating strong performance across hundreds of language tasks and reasoning capabilities.</p>
|
| 2330 |
+
</div>
|
| 2331 |
+
|
| 2332 |
+
<div>
|
| 2333 |
+
<a href="https://arxiv.org/abs/2312.11805"><strong>Gemini</strong></a>
|
| 2334 |
+
<p>Presents Google's multimodal model architecture capable of processing text, images, audio, and video inputs.</p>
|
| 2335 |
+
</div>
|
| 2336 |
+
|
| 2337 |
+
<div>
|
| 2338 |
+
<a href="https://arxiv.org/abs/2412.19437v1"><strong>DeepSeek-V3</strong></a>
|
| 2339 |
+
<p>DeepSeek's report on architecture and training of the DeepSeek-V3 model.</p>
|
| 2340 |
+
</div>
|
| 2341 |
+
|
| 2342 |
+
|
| 2343 |
<h3>Training Frameworks</h3>
|
| 2344 |
+
|
| 2345 |
+
<div>
|
| 2346 |
+
<a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
|
| 2347 |
+
<p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
|
| 2348 |
+
</div>
|
| 2349 |
+
|
| 2350 |
+
<div>
|
| 2351 |
+
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
| 2352 |
+
<p>NVIDIA's framework for training large language models with model and data parallelism.</p>
|
| 2353 |
+
</div>
|
| 2354 |
+
|
| 2355 |
+
<div>
|
| 2356 |
+
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
| 2357 |
+
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism techniques.</p>
|
| 2358 |
+
</div>
|
| 2359 |
+
|
| 2360 |
+
<div>
|
| 2361 |
+
<a href="https://colossalai.org/"><strong>ColossalAI</strong></a>
|
| 2362 |
+
<p>Integrated large-scale model training system with various optimization techniques.</p>
|
| 2363 |
+
</div>
|
| 2364 |
+
|
| 2365 |
+
<div>
|
| 2366 |
+
<a href="https://github.com/pytorch/torchtitan"><strong>torchtitan</strong></a>
|
| 2367 |
+
<p>A PyTorch native library for large model training.</p>
|
| 2368 |
+
</div>
|
| 2369 |
+
|
| 2370 |
+
<div>
|
| 2371 |
+
<a href="https://github.com/EleutherAI/gpt-neox"><strong>GPT-NeoX</strong></a>
|
| 2372 |
+
<p>EleutherAI's framework for training large language models, used to train GPT-NeoX-20B.</p>
|
| 2373 |
+
</div>
|
| 2374 |
+
|
| 2375 |
+
<div>
|
| 2376 |
+
<a href="https://github.com/Lightning-AI/litgpt"><strong>LitGPT</strong></a>
|
| 2377 |
+
<p>Lightning AI's implementation of state-of-the-art open-source LLMs with focus on reproducibility.</p>
|
| 2378 |
+
</div>
|
| 2379 |
+
|
| 2380 |
+
<div>
|
| 2381 |
+
<a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
|
| 2382 |
+
<p>Training language models across compute clusters with DiLoCo.</p>
|
| 2383 |
+
</div>
|
| 2384 |
+
|
| 2385 |
<h3>Debugging</h3>
|
| 2386 |
+
|
| 2387 |
+
<div>
|
| 2388 |
+
<a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"><strong>Speed profiling</strong></a>
|
| 2389 |
+
<p>Official PyTorch tutorial on using the profiler to analyze model performance and bottlenecks.</p>
|
| 2390 |
+
</div>
|
| 2391 |
+
|
| 2392 |
+
<div>
|
| 2393 |
+
<a href="https://pytorch.org/blog/understanding-gpu-memory-1/"><strong>Memory profiling</strong></a>
|
| 2394 |
+
<p>Comprehensive guide to understanding and optimizing GPU memory usage in PyTorch.</p>
|
| 2395 |
+
</div>
|
| 2396 |
+
|
| 2397 |
+
<div>
|
| 2398 |
+
<a href="https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"><strong>TensorBoard Profiler Tutorial</strong></a>
|
| 2399 |
+
<p>Guide to using TensorBoard's profiling tools for PyTorch models.</p>
|
| 2400 |
+
</div>
|
| 2401 |
+
|
| 2402 |
<h3>Distribution Techniques</h3>
|
| 2403 |
+
|
| 2404 |
+
<div>
|
| 2405 |
+
<a href="https://siboehm.com/articles/22/data-parallel-training"><strong>Data parallelism</strong></a>
|
| 2406 |
+
<p>Comprehensive explanation of data parallel training in deep learning.</p>
|
| 2407 |
+
</div>
|
| 2408 |
+
|
| 2409 |
+
<div>
|
| 2410 |
+
<a href="https://arxiv.org/abs/1910.02054"><strong>ZeRO</strong></a>
|
| 2411 |
+
<p>Introduces Zero Redundancy Optimizer for training large models with memory optimization.</p>
|
| 2412 |
+
</div>
|
| 2413 |
+
|
| 2414 |
+
<div>
|
| 2415 |
+
<a href="https://arxiv.org/abs/2304.11277"><strong>FSDP</strong></a>
|
| 2416 |
+
<p>Fully Sharded Data Parallel training implementation in PyTorch.</p>
|
| 2417 |
+
</div>
|
| 2418 |
+
|
| 2419 |
+
<div>
|
| 2420 |
+
<a href="https://arxiv.org/abs/2205.05198"><strong>Tensor and Sequence Parallelism + Selective Recomputation</strong></a>
|
| 2421 |
+
<p>Advanced techniques for efficient large-scale model training combining different parallelism strategies.</p>
|
| 2422 |
+
</div>
|
| 2423 |
+
|
| 2424 |
+
<div>
|
| 2425 |
+
<a href="https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/#pipeline_parallelism"><strong>Pipeline parallelism</strong></a>
|
| 2426 |
+
<p>NVIDIA's guide to implementing pipeline parallelism for large model training.</p>
|
| 2427 |
+
</div>
|
| 2428 |
+
|
| 2429 |
+
<div>
|
| 2430 |
+
<a href="https://arxiv.org/abs/2211.05953"><strong>Breadth first Pipeline Parallelism</strong></a>
|
| 2431 |
+
<p>Includes broad discussions around PP schedules.</p>
|
| 2432 |
+
</div>
|
| 2433 |
+
|
| 2434 |
+
<div>
|
| 2435 |
+
<a href="https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/"><strong>All-reduce</strong></a>
|
| 2436 |
+
<p>Detailed explanation of the ring all-reduce algorithm used in distributed training.</p>
|
| 2437 |
+
</div>
|
| 2438 |
+
|
| 2439 |
+
<div>
|
| 2440 |
+
<a href="https://github.com/zhuzilin/ring-flash-attention"><strong>Ring-flash-attention</strong></a>
|
| 2441 |
+
<p>Implementation of ring attention mechanism combined with flash attention for efficient training.</p>
|
| 2442 |
+
</div>
|
| 2443 |
+
|
| 2444 |
+
<div>
|
| 2445 |
+
<a href="https://coconut-mode.com/posts/ring-attention/"><strong>Ring attention tutorial</strong></a>
|
| 2446 |
+
<p>Tutorial explaining the concepts and implementation of ring attention.</p>
|
| 2447 |
+
</div>
|
| 2448 |
+
|
| 2449 |
+
<div>
|
| 2450 |
+
<a href="https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/#understanding-performance-tradeoff-between-zero-and-3d-parallelism"><strong>ZeRO and 3D</strong></a>
|
| 2451 |
+
<p>DeepSpeed's guide to understanding tradeoffs between ZeRO and 3D parallelism strategies.</p>
|
| 2452 |
+
</div>
|
| 2453 |
+
|
| 2454 |
+
<div>
|
| 2455 |
+
<a href="https://arxiv.org/abs/1710.03740"><strong>Mixed precision training</strong></a>
|
| 2456 |
+
<p>Introduces mixed precision training techniques for deep learning models.</p>
|
| 2457 |
+
</div>
|
| 2458 |
+
|
| 2459 |
<h3>Hardware</h3>
|
| 2460 |
+
|
| 2461 |
+
<div>
|
| 2462 |
+
<a href="https://www.arxiv.org/abs/2408.14158"><strong>Fire-Flyer - a 10,000 PCI chips cluster</strong></a>
|
| 2463 |
+
<p>DeepSeek's report on designing a cluster with 10k PCI GPUs.</p>
|
| 2464 |
+
</div>
|
| 2465 |
+
|
| 2466 |
+
<div>
|
| 2467 |
+
<a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/"><strong>Meta's 24k H100 Pods</strong></a>
|
| 2468 |
+
<p>Meta's detailed overview of their massive AI infrastructure built with NVIDIA H100 GPUs.</p>
|
| 2469 |
+
</div>
|
| 2470 |
+
|
| 2471 |
+
<div>
|
| 2472 |
+
<a href="https://www.semianalysis.com/p/100000-h100-clusters-power-network"><strong>Semianalysis - 100k H100 cluster</strong></a>
|
| 2473 |
+
<p>Analysis of large-scale H100 GPU clusters and their implications for AI infrastructure.</p>
|
| 2474 |
+
</div>
|
| 2475 |
+
|
| 2476 |
<h3>Others</h3>
|
| 2477 |
+
|
| 2478 |
+
<div>
|
| 2479 |
+
<a href="https://github.com/stas00/ml-engineering"><strong>Stas Bekman's Handbook</strong></a>
|
| 2480 |
+
<p>Comprehensive handbook covering various aspects of training LLMs.</p>
|
| 2481 |
+
</div>
|
| 2482 |
+
|
| 2483 |
+
<div>
|
| 2484 |
+
<a href="https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md"><strong>Bloom training chronicles</strong></a>
|
| 2485 |
+
<p>Detailed documentation of the BLOOM model training process and challenges.</p>
|
| 2486 |
+
</div>
|
| 2487 |
+
|
| 2488 |
+
<div>
|
| 2489 |
+
<a href="https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf"><strong>OPT logbook</strong></a>
|
| 2490 |
+
<p>Meta's detailed logbook documenting the training process of the OPT-175B model.</p>
|
| 2491 |
+
</div>
|
| 2492 |
+
|
| 2493 |
+
<div>
|
| 2494 |
+
<a href="https://www.harmdevries.com/post/model-size-vs-compute-overhead/"><strong>Harm's law for training smol models longer</strong></a>
|
| 2495 |
+
<p>Investigation into the relationship between model size and training overhead.</p>
|
| 2496 |
+
</div>
|
| 2497 |
+
|
| 2498 |
+
<div>
|
| 2499 |
+
<a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
|
| 2500 |
+
<p>Investigation into long context training in terms of data and training cost.</p>
|
| 2501 |
+
</div>
|
| 2502 |
|
| 2503 |
<h2>Appendix</h2>
|
| 2504 |
|