Token generation speed is very slow
I saw your post on X mentioning an inference speed of around ~28 toks/s. However, when I tried running this model using mlx_lm on my Mac M2 Ultra, the inference speed was only about ~1 toks/s.
I installed mlx_lm using the command pip install mlx-lm
and tested it with the command:mlx_lm.generate --model mlx-community/Qwen3-235B-A22B-4bit --max-tokens 4096 --prompt "/no-think How are you?"
Could you share how you installed and ran mlx_lm to achieve that speed?
It needs a 192GB machine to work well. How much RAM do you have?
@Awni
My device is M2 Ultra 192GB RAM
It seems like my GPU isn't operating at full capacity. I'm not sure if I might have missed some setup or configuration.
What OS are you on? For wired memory it's important to be on OS 15. That can make a big difference for very large models.
@Awni
It seems like I'm still on macOS 14.5. I will update to macOS 15 and test it again.
Thank you very much.