ubergarm commited on
Commit
8e1f011
·
1 Parent(s): cb5a12f

Add note about 24-48GB VRAM or CPU only optimized quants

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -17,7 +17,7 @@ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/i
17
 
18
  These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
19
 
20
- Perfect for CPU+GPU systems with 24GB+ VRAM, and also CPU *only* rigs using dynamic quant repacking (for maximum memory throughput).
21
 
22
  You could try `ik_llama.cpp` quickly with your *existing* quants, as it computes MLA tensors and repacks quants on the fly at startup (if you have enough RAM+VRAM to fit entire model). Then come check out these fat quants here once you see the difference.
23
 
 
17
 
18
  These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
19
 
20
+ These quants are specifically designed for CPU+GPU systems with 24-48GB VRAM, and also CPU *only* rigs using dynamic quant repacking (for maximum memory throughput). If you have more VRAM, I suggest a different quant with at least some routed expert layers optimized for GPU offload.
21
 
22
  You could try `ik_llama.cpp` quickly with your *existing* quants, as it computes MLA tensors and repacks quants on the fly at startup (if you have enough RAM+VRAM to fit entire model). Then come check out these fat quants here once you see the difference.
23