Inference Much slower as compared to other A3B Models

#47

by engrtipusultan - opened 16 days ago

Discussion

engrtipusultan

16 days ago

•

edited 16 days ago

@ZHANGYUXUAN-zR

I have tested the model speed in llama.cpp on my hardware. It turns out to be much slower than other A3B models. Also pp and tg both drop very quickly and much more than other similar size MoE models. I want to check with you if you have any benchmarks from your internal testing that how model fares with other similar size models for for packet processing and decoding.

I want to understand if it is llama.cpp problem, or vulkan back-end problem or model is like that due to its internal architecture.

llama-bench build: 8f91ca54e (7822)

FA = off

Param	gpt-oss 20B MXFP4 MoE	nemotron_h_moe 31B.A3.5B Q8_0	qwen3moe 30B.A3B Q8_0	GLM4.7 Flash Q8_0
pp512	147.08 ± 1.49	113.02 ± 0.32	119.46 ± 0.18	102.81 ± 0.31
tg128	16.17 ± 0.00	12.05 ± 0.01	12.95 ± 0.01	10.77 ± 0.00
pp512 @ d1024	136.19 ± 1.73	111.24 ± 0.13	105.93 ± 0.34	86.65 ± 0.31
tg128 @ d1024	15.78 ± 0.03	11.84 ± 0.01	12.06 ± 0.06	7.29 ± 0.05
pp512 @ d2048	128.45 ± 1.21	108.86 ± 0.40	94.63 ± 0.51	73.20 ± 0.48
tg128 @ d2048	15.20 ± 0.03	11.50 ± 0.00	11.23 ± 0.00	5.28 ± 0.03
pp512 @ d8096	95.64 ± 0.76	98.47 ± 0.93	56.28 ± 0.18	38.71 ± 0.02
tg128 @ d8096	12.28 ± 0.01	9.17 ± 0.05	5.89 ± 0.05	2.19 ± 0.02

FA = on

params	gpt-oss 20B MXFP4 MoE	nemotron_h_moe 31B.A3.5B Q8_0	qwen3moe 30B.A3B Q8_0	GLM4.7 Flash Q8_0
pp512	146.69 ± 0.87	112.54 ± 0.65	114.26 ± 0.87	86.09 ± 0.12
tg128	16.64 ± 0.01	12.12 ± 0.01	13.39 ± 0.01	10.97 ± 0.01
pp512 @ d1024	132.76 ± 0.39	107.09 ± 0.32	77.43 ± 0.10	50.39 ± 0.10
tg128 @ d1024	16.36 ± 0.08	12.05 ± 0.01	12.29 ± 0.00	9.76 ± 0.01
pp512 @ d2048	120.38 ± 0.10	101.26 ± 0.28	55.47 ± 0.35	35.40 ± 0.02
tg128 @ d2048	16.11 ± 0.08	11.98 ± 0.00	11.66 ± 0.01	8.79 ± 0.00
pp512 @ d8096	77.32 ± 0.34	77.85 ± 0.48	20.76 ± 0.17	12.94 ± 0.01
tg128 @ d8096	14.91 ± 0.01	11.52 ± 0.00	8.92 ± 0.00	5.58 ± 0.00

engrtipusultan changed discussion title from Inference Mush slower as compared to other A3B Models to Inference Much slower as compared to other A3B Models 16 days ago

CHNtentes

15 days ago

I suppose you should compare the speed with vllm / sglang, since llama.cpp support is not added by z.ai

engrtipusultan

15 days ago

•

edited 15 days ago

I do not have machine to do that. That is why I am asking for results from there internal testing. That will give reference whether it is due to inference engine or model itself is like this. Qwen team gave benchmarks for Qwen3 Next against other similar size models.
So far GPT OSS 20B is king at lower contexts and NVIDIA-Nemotron-3-Nano-30B-A3B is best due to Mamba-2 architecture in terms of retaining PP and TG.

ayylmaonade

5 days ago

Yeah, it's really rough in my experience. I'm praying that the implementations in all the backends improve - this model is potent.

jmander11

1 day ago

In my very limited experience vulkan is a joke and you should never even think about using it. If your gpu only supports vulkan due to age or something then I think this is the issue. You should not be trying to use vulkan as a benchmark for anything, so even if it's notably slower on this model versus others in the same MoE size class it's just not the type of backend you want to be measuring performance against.

I get it if it's all you have, but vulkan is just so brutal. I have an amd r9700 and vulkan was what I first tried when learning llama.cpp/vllm and I was ready to return this card until I went the native rocm route for it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment