inference on RTX 4090 24GB . i cant
#46
by
abbas381366
- opened
hi.
in model card say this model can inference on single RTX 4090 .
but via vLLM i can't.
please help me if it can be done.
As I undarstand on the card model is information that the model fits on a single RTX 4090 once it is quantized, so for example you can try to use this one: https://huggingface.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF/blob/main/mistralai_Mistral-Small-3.1-24B-Instruct-2503-Q4_K_M.gguf.
thanks .. i test it but with quantization all model can fit to RTX 4090 24GB :D ..
i inference llama 3.1 70B on RTX 4090 ...
model card cheat me :D
abbas381366
changed discussion status to
closed