Add note about 24-48GB VRAM or CPU only optimized quants
Browse files
README.md
CHANGED
@@ -17,7 +17,7 @@ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/i
|
|
17 |
|
18 |
These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
|
19 |
|
20 |
-
|
21 |
|
22 |
You could try `ik_llama.cpp` quickly with your *existing* quants, as it computes MLA tensors and repacks quants on the fly at startup (if you have enough RAM+VRAM to fit entire model). Then come check out these fat quants here once you see the difference.
|
23 |
|
|
|
17 |
|
18 |
These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
|
19 |
|
20 |
+
These quants are specifically designed for CPU+GPU systems with 24-48GB VRAM, and also CPU *only* rigs using dynamic quant repacking (for maximum memory throughput). If you have more VRAM, I suggest a different quant with at least some routed expert layers optimized for GPU offload.
|
21 |
|
22 |
You could try `ik_llama.cpp` quickly with your *existing* quants, as it computes MLA tensors and repacks quants on the fly at startup (if you have enough RAM+VRAM to fit entire model). Then come check out these fat quants here once you see the difference.
|
23 |
|