PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
Abstract
Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.
Community
prima.cpp is a distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices—💻 laptops, 🖥️ desktops, 📱 phones, and tablets (GPU or no GPU, it’s all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!
Worried about OOM or your device stucking? Never again! prima.cpp keeps its memory pressure below 10%, you can run very large models while enjoying Tiktok (if you don't mind the inference speed).
How about speed? Built upon llama.cpp, but it’s 15x faster! 🚀 On poor devices, QwQ-32B generates 11 tokens per second, and Llama 3-70B generates 1.5 tokens per second. That's about the same speed as audiobook apps, from slow to fast speaking. Then, you can have private chats without privacy concerns.
If your devices are more powerful, you could unlock even more possibilities, like running LLM agents right in your home!
Key features:
- Heterogeneous, low-resource, cross-platform clusters (e.g., home devices connected by Wi-Fi)
- Quantization (Q4K and IQ1)
- Mixed CPU/GPU computing
- Disk offloading
- Piped-ring parallelism with prefetching
- Automatic workload distribution
Wait, are you telling me I can run a 70B model with a reasonable speed on my 16GB of RAM and 8GB VRAM? In Vulkan too? Where? How? Me wants! 🤗😳
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper