Too much VRAM in vLLM

#75

by cbrug - opened Jun 11

Jun 11

I'm trying to deploy the gemma model using 4 A100 (40GB) GPUs.
This should be overkill for the system, but it goes OoM while preparing.

This is the output regarding a single GPU (the other 3 have more or less the same).

the current vLLM instance can use total_gpu_memory (39.38GiB) x gpu_memory_utilization (0.90) = 35.44GiB
model weights take 13.17GiB; non_torch_memory takes 2.09GiB; PyTorch activation peak memory takes 17.91GiB; the rest of the memory reserved for KV Cache is 2.28GiB.
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (19264).

I don't understand why it occupies so much space, it should be much less, more than enough to use the model max len.
The cause might be the PyTorch activation peak memory of 18GB, it's unusually high. Any advice?

Libraries

accelerate                               1.7.0
torch                                    2.7.0
torchaudio                               2.7.0
torchvision                              0.22.0
transformers                             4.52.4
vllm                                     0.9.1

lyalyukev

Jun 11

Change context 131072, set 4096 for test

cbrug

Jun 12

•

edited Jun 12

Change context 131072, set 4096 for test

Ok, reducing the context permits to reduce the Activation memory.

model weights take 13.17GiB; non_torch_memory takes 1.95GiB; PyTorch activation peak memory takes 1.41GiB; the rest of the memory reserved for KV Cache is 18.92GiB.

So the only solution is to not use it on its full potential? Seems odd

philtimmes

Jun 13

Set --max-num-seq to below 8.
4 is good.
1 is better for ram usage.

lkv

Google org 5 days ago

Hi @cbrug , Sorry for late response, You need to explicitly set the max_model_len during model initialization to a practical value (e.g., Gemma's standard 8192). Additionally, to use all four of your A100s efficiently, you must enable tensor parallelism.

Kindly find the below code , Use all 4 of your GPUs and Manually set a reasonable max length.

llm = LLM( model=model_name, tensor_parallel_size=4, max_model_len=8192 )
Kindly try and let us know if you have any concerns will assist you. Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment