72GB VRAM Users - What size quants do you want (70B models) ?

#2
by FrenzyBiscuit - opened
Ready.Art org

This question is specifically for people with 72GB VRAM.

What size quants do you want for 70B models?

Currently, I'm thinking 5.35 bpw and 6.70 bpw, but not sure on the rest.

FrenzyBiscuit changed discussion status to closed

How do you get that ammount of VRAM?

3x3090/4090/5090, the 96GB pro 6000.

I see, thanks. And does 3x3090 deliver a reasonable tokens/s for a 70B model?

Ready.Art org

I see, thanks. And does 3x3090 deliver a reasonable tokens/s for a 70B model?

I don’t have hard numbers on hand but with a 5.0 bpw exl3 3x3090 power limited to 200w maintain around 13-15 t/s on a 60k fp16 context roleplay.

This is with tensor parallelism enabled.

However, this is without prompt reprocessing (no lorebook or rag).

Prompt reprocessing is not fast at all on exl3 because it’s not optimized for ampere.

Even on exl2 3x3090 tends to be slow on prompt reprocessing at 60k context on 70B models.

If that bothers you I’d advise 3x4090 or 3x5090.

FrenzyBiscuit changed discussion status to open

Great, thanks for all the details

FrenzyBiscuit changed discussion status to closed

(128gb) Q4s are generally fine, much prefer MoE models above >30B though, much more important for speed

Sign up or log in to comment