Token generation speed is very slow

by huynguyendbs - opened May 22

May 22

I saw your post on X mentioning an inference speed of around ~28 toks/s. However, when I tried running this model using mlx_lm on my Mac M2 Ultra, the inference speed was only about ~1 toks/s.

I installed mlx_lm using the command pip install mlx-lm and tested it with the command:
mlx_lm.generate --model mlx-community/Qwen3-235B-A22B-4bit --max-tokens 4096 --prompt "/no-think How are you?"

Could you share how you installed and ran mlx_lm to achieve that speed?

awni

MLX Community org May 22

It needs a 192GB machine to work well. How much RAM do you have?

huynguyendbs

May 22

•

edited May 22

@Awni
My device is M2 Ultra 192GB RAM
It seems like my GPU isn't operating at full capacity. I'm not sure if I might have missed some setup or configuration.

awni

MLX Community org May 22

What OS are you on? For wired memory it's important to be on OS 15. That can make a big difference for very large models.

huynguyendbs

May 22

@Awni
It seems like I'm still on macOS 14.5. I will update to macOS 15 and test it again.
Thank you very much.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment