can it work on single 3090?

#7
by hamaadtahiir - opened

What are the vram requirements and can I run it on single 3090 with vllm?

It can be made to fit, but it's right at the limit - you'll need to do basically everything you can to reduce memory usage.

I'm running on an A10G (24GB) with PYTORCH_CUDA_ALLOC_CONF=expendable_segments:True and max_model_length=4800, enforce_eager=True, gpu_memory_utilization=0.98, kv_cache_dtype="fp8". This works but isn't practical for production use - it's slow, prone to OOM, and you have to really scrimp with your tokens.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment