Papers - MoE - Training
updated
Robust Mixture-of-Expert Training for Convolutional Neural Networks
Paper
• 2308.10110
• Published
• 2
Experts Weights Averaging: A New General Training Scheme for Vision
Transformers
Paper
• 2308.06093
• Published
• 2
ConstitutionalExperts: Training a Mixture of Principle-based Prompts
Paper
• 2403.04894
• Published
• 2
Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language
Models
Paper
• 2403.03432
• Published
• 1
Not All Experts are Equal: Efficient Expert Pruning and Skipping for
Mixture-of-Experts Large Language Models
Paper
• 2402.14800
• Published
• 3
Multilinear Mixture of Experts: Scalable Expert Specialization through
Factorization
Paper
• 2402.12550
• Published
• 2
Buffer Overflow in Mixture of Experts
Paper
• 2402.05526
• Published
• 9
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Paper
• 2211.15841
• Published
• 8
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts Layer
Paper
• 1701.06538
• Published
• 7
LocMoE: A Low-overhead MoE for Large Language Model Training
Paper
• 2401.13920
• Published
• 2
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to
Power Next-Generation AI Scale
Paper
• 2201.05596
• Published
• 2
Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
Paper
• 2304.11414
• Published
• 2
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
Paper
• 2401.06066
• Published
• 59
HyperRouter: Towards Efficient Training and Inference of Sparse Mixture
of Experts
Paper
• 2312.07035
• Published
• 2
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
• 2402.14289
• Published
• 20
AMEND: A Mixture of Experts Framework for Long-tailed Trajectory
Prediction
Paper
• 2402.08698
• Published
• 2
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper
• 2312.17238
• Published
• 7
Sparse Backpropagation for MoE Training
Paper
• 2310.00811
• Published
• 2
FedJETs: Efficient Just-In-Time Personalization with Federated Mixture
of Experts
Paper
• 2306.08586
• Published
• 1
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
Architecture-Routed Mixture-of-Experts
Paper
• 2306.04845
• Published
• 4
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Paper
• 2403.07816
• Published
• 44
Unified Scaling Laws for Routed Language Models
Paper
• 2202.01169
• Published
• 2
Paper
• 2407.10671
• Published
• 168
Paper
• 2412.09764
• Published
• 5