ReadyArt/README · 72GB VRAM Users - What size quants do you want (70B models) ?

Ready.Art org Jul 21

This question is specifically for people with 72GB VRAM.

What size quants do you want for 70B models?

Currently, I'm thinking 5.35 bpw and 6.70 bpw, but not sure on the rest.

FrenzyBiscuit changed discussion status to closed 16 days ago

16 days ago

How do you get that ammount of VRAM?

Ready.Art org 16 days ago

•

3x3090/4090/5090, the 96GB pro 6000.

16 days ago

I see, thanks. And does 3x3090 deliver a reasonable tokens/s for a 70B model?

Ready.Art org 16 days ago

I see, thanks. And does 3x3090 deliver a reasonable tokens/s for a 70B model?

I don’t have hard numbers on hand but with a 5.0 bpw exl3 3x3090 power limited to 200w maintain around 13-15 t/s on a 60k fp16 context roleplay.

This is with tensor parallelism enabled.

However, this is without prompt reprocessing (no lorebook or rag).

Prompt reprocessing is not fast at all on exl3 because it’s not optimized for ampere.

Even on exl2 3x3090 tends to be slow on prompt reprocessing at 60k context on 70B models.

If that bothers you I’d advise 3x4090 or 3x5090.

FrenzyBiscuit changed discussion status to open 16 days ago

15 days ago

Great, thanks for all the details

FrenzyBiscuit changed discussion status to closed 15 days ago

10 days ago

(128gb) Q4s are generally fine, much prefer MoE models above >30B though, much more important for speed