Why VRAM Nearly doubled from Small 3.1?

#14
by rdodev - opened

Folks,

Trying to run this model essentially entails doubling our infrastructure. Small 3.1 fit easily on a single H100 with plenty of headroom. With 3.2 we need to use 2xH100 because VRAM itself is >55Gb and then the kvcache and map puts it past the 80GB of a single H100. Both are using same quant. Why the big difference?

Any info here @juliendenize ?

Mistral AI_ org

Hi,

Indeed there shouldn't be any difference, could you provide the code snippet you use to serve the model ?

No difference on my side, maybe you didn't set the same max-model-len, which resulted to much more VRAM consumption for the KV Cache.

Apologies, I see issue was on my end. I had forgotten I had 3.1 set on "fp8" quant.

rdodev changed discussion status to closed

Sign up or log in to comment