KVCache-ai
/

DeepSeek-R1-GGML-FP8-Hybrid

Model card Files Files and versions Community

DeepSeek-R1-UD-IQ1_M-FP8 : Support and Perf Results on v0.2.3

by shawxysu - opened 3 days ago

Discussion

shawxysu

3 days ago

•

edited about 8 hours ago

Hi team!

First, I want to express my gratitude for your incredible work. KTransformers is a game-changer for the LocalLlama community, allowing us to efficiently run top-tier models like DeepSeek-R1 in-house. 🚀

I’d love to share some exciting benchmarking results from my local testing and seek clarification on FP8 performance.

Why IQ1_M?

Memory Constraints
Modern consumer CPUs, like the AMD 9950X and Intel 14900K, support up to 192GB RAM, making IQ1_M the largest viable option before hitting OOM.
Quality Gains
Based on my experience, IQ1_M performs significantly better than IQ1_S. Unsloth’s benchmarking data supports this:

Model Size Score

IQ1_S 131GB 6.92

IQ1_M 158GB 9.08

Model	Size	Score
IQ1_S	131GB	6.92
IQ1_M	158GB	9.08

Test Environment

Hardware: AMD 9950X + Nvidia RTX 4090, 192GB DDR5 4800 (a high-end gaming/workstation setup, no server-grade hardware).
Software: WSL 2 on Windows 11, running CUDA 12.8 + torch 2.6.0 + flash_attn-2.7.4 + triton 3.2.0, without flashinfer.
llama.cpp @ d2fe216
KTransformers @ v0.2.3

Performance Results

Prompt: Create a Flappy Bird game in Python.
Args: --cpu_infer 17 --max_new_tokens 8192
(Benchmark was taken after a warm-up run.)

Model	TG
llama.cpp	2.74
KT v0.2.3	5.63
KT FP8	5.51

KT on prompt eval is dominating: whooping 42 tokens/s versus llama.cpp 9 tokens/s.
(Tested on prompts of ~pp1024 length generated by another model)

Discussion

1. FP8 Performance Clarification

The documentation states:

"Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy). So those pursuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1."

However, "performance" could mean speed (tokens/sec) or quality. My tests show that FP8 is not faster than non-FP8. Is this expected? Could you clarify whether FP8 is meant to improve speed, accuracy, or both?

2. Perf-tuning advice for further improvements

While I am already very happy with the speedup, I would appreciate your input on any potential steps I can try on my side to gather more insights and push it further.

3. Official Support for IQ1_M-FP8

I successfully ran merge_model locally without issues, but an official IQ1_M-FP8 release in the repo would be highly beneficial—allowing users to avoid downloading the full R1 model.

Once again, thank you for this fantastic project! I’m sharing these exciting results with my fellow enthusiasts and eagerly looking forward to future updates. 😊

ubergarm

1 day ago

•

edited 1 day ago

@shawxysu

Heya, amazing work pushing the limits of home inferencing! My slightly overclocked 9950X + 96GB DDR5-6200 + 3090TI FE 24GB VRAM + PCIe 5.0 T700 2TB NVMe can hit about 3.5 tok/sec on R1 671B UD-Q2_K_XL 212GiB quant with ktransformers@7a19f3b while thrashing the page cache pulling over 6GB/s buffered i/o off the disk. Video of it in action here

While the verboten 4x DIMMs configuration is great to fit the entire model into RAM, you will likely take a hit on memory bandwidth unless you really won the silicon lottery with the memory controller io die.

A few thoughts:

I also tried flashinfer, but didn't notice a performance benefit and also quality seemed degraded getting stuck in loops. So I prefer triton for now.
The fp8 only works on Hopper and newer CUDA architectures nvidia-smi --query-gpu=compute_cap --format=csv >= 9.0. Unfortunately my 3090TI and RTX A6000 are only 8.6 so do not support fp8 natively. I believe it is designed to preserve the original model weights in the original trained quantization for quality reasons and not for speed/performance reasons.
What version of llama.cpp are you using? Some of the ktransformers optimizations are in experimental branches for llama.cpp right now like selective offloading, MLA, and maybe soon Data Parallel (only interesting for dual socket AMD Epyc NPS1 or Intel XeonSMC=Disable BIOS configurations. Fortuately the 9950X and ThreadRipper Pro support single numa node.

I have written up a guide for deploying ktransformers here and hope to add some commands for custom hybrid quantization given newer Intel Xeon AMX extensions CPU flags enable amx_bf16 amx_tile amx_int8 and there are some int8 quants going around now for supposedly 33% faster inferencing on that target hardware.

I have a few unofficial advanced performance tuning tips which may or may not help you out here in this ktransformers Issue #806 specifically trying google/tcmalloc as suggested by the vLLM project.

Also have you tried Linux native performance with transparent huge pages enabled?

Sorry there is so much spread out all over the place hahah... Hope to keep in touch and share information!

Cheers!

shawxysu

您好，感谢您在本地推理领域的卓越探索！我的超频配置（Ryzen 9950X + 96GB DDR5-6200 + 3090TI FE 24GB显存 + PCIe 5.0 T700 2TB NVMe）在运行R1 671B UD-Q2_K_XL 212GiB量化模型（使用ktransformers@7a19f3b）时，通过缓冲I/O以超过6GB/s的速度加载数据至页缓存，可实现约3.5令牌/秒的推理速度。演示视频

虽然非官方支持的4通道DIMM配置可将完整模型载入内存，但除非内存控制器I/O Die体质极佳（硅晶彩票中奖），否则内存带宽可能受限。

几点思考：

尝试过flashinfer但未观察到性能提升，且输出质量下降出现循环卡顿问题，故暂倾向使用triton方案
fp8仅支持Hopper及更新的CUDA架构（nvidia-smi --query-gpu=compute_cap --format=csv >= 9.0）。3090TI和RTX A6000的8.6版本无法原生支持fp8。推测该设计旨在保持原量化模型精度，而非追求速度优化
您使用的llama.cpp版本？部分ktransformers优化（如选择性卸载、MLA及未来数据并行）尚处实验阶段。数据并行仅对双路AMD Epyc NPS1或Intel Xeon SMC=Disable BIOS配置有意。所幸9950X和ThreadRipper Pro支持单NUMA节点

我编写了ktransformers部署指南，计划补充基于Intel Xeon AMX扩展（amx_bf16 amx_tile amx_int8）的混合量化指令。现有int8量化模型可在该硬件上实现推理速度提升33%。

在ktransformers Issue #806中，我总结了非官方性能调优建议（如采用vLLM推荐的google/tcmalloc）。

另建议尝试：

启用Linux透明大页

信息较为分散，见谅！期待进一步交流分享。

顺颂时祺！

shawxysu

about 9 hours ago

Hey @ubergarm , thanks for sharing your insights!

You know what—I'm also using the T700 2TB, and it's been rock solid. I actually relied heavily on your guide when I first tried KT, and I even caught your video on Reddit the day you posted it! Big shoutout for all your great work—it’s super inspiring for those of us just starting out.

And yeah, the silicon lottery... IKR! Despite having a high-end Aorus motherboard + G.Skill memory modules, I just can't push past 4800.

I gave tcmalloc a shot and the impact feels like it's within the margin of fluctuation. As for running native Linux with THP, I plan to try it once I get the chance.

Quick question—what made you go with UD-Q2_K_XL? I’m really curious if you’ve noticed any significant quality or speed improvements over the IQ variants.

Cheers!

ubergarm

about 5 hours ago

•

edited about 5 hours ago

@shawxysu - excellent! exciting to meet a fellow decker rocking a troll rig in the trenches 😆

Despite having a high-end Aorus motherboard + G.Skill memory modules, I just can't push past 4800.

Yeah, have you raised Vsoc to 1.27v and all that jazz? There are two good threads on it over on level1techs:

Ryzen 9950X RAM Tuning and Benchmarks
192gb DDR5 9950x AMD5 - guy got 4x 48GB stable at 6000MHz

I gave tcmalloc a shot and the impact feels like it's within the margin of fluctuation

Yeah no noticeable effect really, maybe it loads a bit faster but I didn't actually time it lol...

Quick question—what made you go with UD-Q2_K_XL

tbh it was the first I downloaded and I didn't want to download another one haha... I heard a couple anecdotal reports that for some the IQ quants were slower, but might have been on older CPUs. Also check out this report on perplexity which suggests the UD-Q2_K_XL is as much better as the IQ1_M as it is over the IQ1_S, but tbh it isn't clear perplexity for that specific text on a big MoE means the same as in smaller dense models, so take it for what you will. Plus if you are using fp8 hybrid that might be more important than the regular experts anyway.

The latest things I'm keeping an eye on are:

ik_llama.cpp fork is pushing hard for R1 improvements.
Reading llama.cpp PR#12227 gave me an education watching them try to stuff the compressed latents into kv-cache for MLA. Plus the -ngl=63 -ot=exps override tensor offload experimental branch give llama.cpp a big boost, but they seem to be moving a bit more cautiously than ktransformers.
Otherwise I've discovered some pgood YT videos covering distributed tiled generalized matrix multiplication algoirthms (GeMM) which needs some kinda data parallel, tensor parallel, MoE parallel, and potentially pipeline parallel implementations for improving hybrid CPU/GPU hybrid
Speaking of which ktransformers feature open for some of these optimizations (which might help big multi NUMA Epyc / Xeon servers more than us single NUMA node gamer bois lol.
One last thing, you could probably roll your own "dynamic" quants by editing this PR by unsloth brother Daniel Han if you want to target specific quants as suggested by jukofyork in the override tensor offload experimental branch. Especially given you have fp8 capability might be neat, but I am still downloading the fp8 to try quantizing myself eventually haha...

That's all I've got for now! 🫡

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment