Why VRAM Nearly doubled from Small 3.1?
#14
by
rdodev
- opened
Folks,
Trying to run this model essentially entails doubling our infrastructure. Small 3.1 fit easily on a single H100 with plenty of headroom. With 3.2 we need to use 2xH100 because VRAM itself is >55Gb and then the kvcache and map puts it past the 80GB of a single H100. Both are using same quant. Why the big difference?
Any info here @juliendenize ?
Hi,
Indeed there shouldn't be any difference, could you provide the code snippet you use to serve the model ?
No difference on my side, maybe you didn't set the same max-model-len, which resulted to much more VRAM consumption for the KV Cache.
Apologies, I see issue was on my end. I had forgotten I had 3.1 set on "fp8" quant.
rdodev
changed discussion status to
closed