Why VRAM Nearly doubled from Small 3.1?

#14

by rdodev - opened 7 days ago

7 days ago

Folks,

Trying to run this model essentially entails doubling our infrastructure. Small 3.1 fit easily on a single H100 with plenty of headroom. With 3.2 we need to use 2xH100 because VRAM itself is >55Gb and then the kvcache and map puts it past the 80GB of a single H100. Both are using same quant. Why the big difference?

Any info here @juliendenize ?

juliendenize

Mistral AI_ org 5 days ago

Hi,

Indeed there shouldn't be any difference, could you provide the code snippet you use to serve the model ?

jackdu0

2 days ago

No difference on my side, maybe you didn't set the same max-model-len, which resulted to much more VRAM consumption for the KV Cache.

rdodev

2 days ago

Apologies, I see issue was on my end. I had forgotten I had 3.1 set on "fp8" quant.

rdodev changed discussion status to closed 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment