Distributed Training Papers and resources related to distributed training. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel Paper • 2304.11277 • Published Apr 21, 2023 • 1 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Paper • 1909.08053 • Published Sep 17, 2019 • 2 Reducing Activation Recomputation in Large Transformer Models Paper • 2205.05198 • Published May 10, 2022 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Paper • 1811.06965 • Published Nov 16, 2018
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel Paper • 2304.11277 • Published Apr 21, 2023 • 1
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Paper • 1909.08053 • Published Sep 17, 2019 • 2
Reducing Activation Recomputation in Large Transformer Models Paper • 2205.05198 • Published May 10, 2022
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Paper • 1811.06965 • Published Nov 16, 2018
michaelbenayoun/llama-2-tiny-4kv-heads-16layers-random Feature Extraction • Updated Mar 14 • 1.04k
michaelbenayoun/mistral-tiny-4layers-8kv-heads-random Text Generation • Updated Nov 9, 2023 • 597