can it work on single 3090?
#7
by
hamaadtahiir
- opened
What are the vram requirements and can I run it on single 3090 with vllm?
It can be made to fit, but it's right at the limit - you'll need to do basically everything you can to reduce memory usage.
I'm running on an A10G (24GB) with PYTORCH_CUDA_ALLOC_CONF=expendable_segments:True
and max_model_length=4800, enforce_eager=True, gpu_memory_utilization=0.98, kv_cache_dtype="fp8"
. This works but isn't practical for production use - it's slow, prone to OOM, and you have to really scrimp with your tokens.