Too much VRAM in vLLM

#75
by cbrug - opened

I'm trying to deploy the gemma model using 4 A100 (40GB) GPUs.
This should be overkill for the system, but it goes OoM while preparing.

This is the output regarding a single GPU (the other 3 have more or less the same).

the current vLLM instance can use total_gpu_memory (39.38GiB) x gpu_memory_utilization (0.90) = 35.44GiB
model weights take 13.17GiB; non_torch_memory takes 2.09GiB; PyTorch activation peak memory takes 17.91GiB; the rest of the memory reserved for KV Cache is 2.28GiB.
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (19264).

I don't understand why it occupies so much space, it should be much less, more than enough to use the model max len.
The cause might be the PyTorch activation peak memory of 18GB, it's unusually high. Any advice?

Libraries

accelerate                               1.7.0
torch                                    2.7.0
torchaudio                               2.7.0
torchvision                              0.22.0
transformers                             4.52.4
vllm                                     0.9.1

Change context 131072, set 4096 for test

Change context 131072, set 4096 for test

Ok, reducing the context permits to reduce the Activation memory.

model weights take 13.17GiB; non_torch_memory takes 1.95GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 18.92GiB.

So the only solution is to not use it on its full potential? Seems odd

Set --max-num-seq to below 8.
4 is good.
1 is better for ram usage.

Sign up or log in to comment