Inference Much slower as compared to other A3B Models
I have tested the model speed in llama.cpp on my hardware. It turns out to be much slower than other A3B models. Also pp and tg both drop very quickly and much more than other similar size MoE models. I want to check with you if you have any benchmarks from your internal testing that how model fares with other similar size models for for packet processing and decoding.
I want to understand if it is llama.cpp problem, or vulkan back-end problem or model is like that due to its internal architecture.
llama-bench build: 8f91ca54e (7822)
FA = off
| Param | gpt-oss 20B MXFP4 MoE | nemotron_h_moe 31B.A3.5B Q8_0 | qwen3moe 30B.A3B Q8_0 | GLM4.7 Flash Q8_0 |
|---|---|---|---|---|
| pp512 | 147.08 ± 1.49 | 113.02 ± 0.32 | 119.46 ± 0.18 | 102.81 ± 0.31 |
| tg128 | 16.17 ± 0.00 | 12.05 ± 0.01 | 12.95 ± 0.01 | 10.77 ± 0.00 |
| pp512 @ d1024 | 136.19 ± 1.73 | 111.24 ± 0.13 | 105.93 ± 0.34 | 86.65 ± 0.31 |
| tg128 @ d1024 | 15.78 ± 0.03 | 11.84 ± 0.01 | 12.06 ± 0.06 | 7.29 ± 0.05 |
| pp512 @ d2048 | 128.45 ± 1.21 | 108.86 ± 0.40 | 94.63 ± 0.51 | 73.20 ± 0.48 |
| tg128 @ d2048 | 15.20 ± 0.03 | 11.50 ± 0.00 | 11.23 ± 0.00 | 5.28 ± 0.03 |
| pp512 @ d8096 | 95.64 ± 0.76 | 98.47 ± 0.93 | 56.28 ± 0.18 | 38.71 ± 0.02 |
| tg128 @ d8096 | 12.28 ± 0.01 | 9.17 ± 0.05 | 5.89 ± 0.05 | 2.19 ± 0.02 |
FA = on
| params | gpt-oss 20B MXFP4 MoE | nemotron_h_moe 31B.A3.5B Q8_0 | qwen3moe 30B.A3B Q8_0 | GLM4.7 Flash Q8_0 |
|---|---|---|---|---|
| pp512 | 146.69 ± 0.87 | 112.54 ± 0.65 | 114.26 ± 0.87 | 86.09 ± 0.12 |
| tg128 | 16.64 ± 0.01 | 12.12 ± 0.01 | 13.39 ± 0.01 | 10.97 ± 0.01 |
| pp512 @ d1024 | 132.76 ± 0.39 | 107.09 ± 0.32 | 77.43 ± 0.10 | 50.39 ± 0.10 |
| tg128 @ d1024 | 16.36 ± 0.08 | 12.05 ± 0.01 | 12.29 ± 0.00 | 9.76 ± 0.01 |
| pp512 @ d2048 | 120.38 ± 0.10 | 101.26 ± 0.28 | 55.47 ± 0.35 | 35.40 ± 0.02 |
| tg128 @ d2048 | 16.11 ± 0.08 | 11.98 ± 0.00 | 11.66 ± 0.01 | 8.79 ± 0.00 |
| pp512 @ d8096 | 77.32 ± 0.34 | 77.85 ± 0.48 | 20.76 ± 0.17 | 12.94 ± 0.01 |
| tg128 @ d8096 | 14.91 ± 0.01 | 11.52 ± 0.00 | 8.92 ± 0.00 | 5.58 ± 0.00 |
I suppose you should compare the speed with vllm / sglang, since llama.cpp support is not added by z.ai
I do not have machine to do that. That is why I am asking for results from there internal testing. That will give reference whether it is due to inference engine or model itself is like this. Qwen team gave benchmarks for Qwen3 Next against other similar size models.
So far GPT OSS 20B is king at lower contexts and NVIDIA-Nemotron-3-Nano-30B-A3B is best due to Mamba-2 architecture in terms of retaining PP and TG.
Yeah, it's really rough in my experience. I'm praying that the implementations in all the backends improve - this model is potent.
In my very limited experience vulkan is a joke and you should never even think about using it. If your gpu only supports vulkan due to age or something then I think this is the issue. You should not be trying to use vulkan as a benchmark for anything, so even if it's notably slower on this model versus others in the same MoE size class it's just not the type of backend you want to be measuring performance against.
I get it if it's all you have, but vulkan is just so brutal. I have an amd r9700 and vulkan was what I first tried when learning llama.cpp/vllm and I was ready to return this card until I went the native rocm route for it.