inference on RTX 4090 24GB . i cant

#46

by abbas381366 - opened Mar 29

Mar 29

hi.
in model card say this model can inference on single RTX 4090 .
but via vLLM i can't.
please help me if it can be done.

Mar 29

As I undarstand on the card model is information that the model fits on a single RTX 4090 once it is quantized, so for example you can try to use this one: https://huggingface.co/bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF/blob/main/mistralai_Mistral-Small-3.1-24B-Instruct-2503-Q4_K_M.gguf.

Mar 29

thanks .. i test it but with quantization all model can fit to RTX 4090 24GB :D ..
i inference llama 3.1 70B on RTX 4090 ...
model card cheat me :D

abbas381366 changed discussion status to closed Mar 30

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment