HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
Abstract
The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33times in the prefill stage and 1.70times in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.
Community
HybriMoE: a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system, achieving an average speedup of 1.33× in the prefill stage and 1.70× in the decode stage compared to state-of-the-art hybrid MoE inference framework.
Code is available at https://github.com/PKU-SEC-Lab/HybriMoE.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Accurate Expert Predictions in MoE Inference via Cross-Layer Gate (2025)
- PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices (2025)
- SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading (2025)
- MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching (2025)
- MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing (2025)
- Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference (2025)
- FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper