arxiv:2504.08791

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Published on Apr 7

· Submitted by

LIKirin on Apr 15

#2 Paper of the day

Upvote

Authors:

Zonghang Li ,

Tao Li ,

Wenjiao Feng ,

Abstract

Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

LIKirin

Paper author Paper submitter about 23 hours ago

prima.cpp is a distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices—💻 laptops, 🖥️ desktops, 📱 phones, and tablets (GPU or no GPU, it’s all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!

Worried about OOM or your device stucking? Never again! prima.cpp keeps its memory pressure below 10%, you can run very large models while enjoying Tiktok (if you don't mind the inference speed).

How about speed? Built upon llama.cpp, but it’s 15x faster! 🚀 On poor devices, QwQ-32B generates 11 tokens per second, and Llama 3-70B generates 1.5 tokens per second. That's about the same speed as audiobook apps, from slow to fast speaking. Then, you can have private chats without privacy concerns.

If your devices are more powerful, you could unlock even more possibilities, like running LLM agents right in your home!

LIKirin

Paper author Paper submitter about 21 hours ago

Key features:

Heterogeneous, low-resource, cross-platform clusters (e.g., home devices connected by Wi-Fi)
Quantization (Q4K and IQ1)
Mixed CPU/GPU computing
Disk offloading
Piped-ring parallelism with prefetching
Automatic workload distribution