Token generation speed is very slow

#4
by huynguyendbs - opened

@Awni

I saw your post on X mentioning an inference speed of around ~28 toks/s. However, when I tried running this model using mlx_lm on my Mac M2 Ultra, the inference speed was only about ~1 toks/s.
image.png

I installed mlx_lm using the command pip install mlx-lm and tested it with the command:
mlx_lm.generate --model mlx-community/Qwen3-235B-A22B-4bit --max-tokens 4096 --prompt "/no-think How are you?"

Could you share how you installed and ran mlx_lm to achieve that speed?

MLX Community org

It needs a 192GB machine to work well. How much RAM do you have?

@Awni
My device is M2 Ultra 192GB RAM
It seems like my GPU isn't operating at full capacity. I'm not sure if I might have missed some setup or configuration.

MLX Community org

What OS are you on? For wired memory it's important to be on OS 15. That can make a big difference for very large models.

@Awni
It seems like I'm still on macOS 14.5. I will update to macOS 15 and test it again.
Thank you very much.
image.png

Sign up or log in to comment