how i use this version model in vllm serve
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model mistral-small-24b-bnb-4bit --max_model_len=20000 --port 8080 --quantization bitsandbytes --load-format bitsandbytes --tokenizer_mode mistral --config_format mistral --tool-call-parser mistral --enable-auto-tool-choice
this command is not ok
I have the same doubt. I would like to know if this is possible or if we have to wait for an update of vLLM.
My bad, wrong answer...
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model mistral-small-24b-bnb-4bit --max_model_len=20000 --port 8080 --quantization bitsandbytes --load-format bitsandbytes --tokenizer_mode mistral --config_format mistral --tool-call-parser mistral --enable-auto-tool-choice
this command is not ok
I have the same doubt. I would like to know if this is possible or if we have to wait for an update of vLLM.
My bad, wrong answer...
In the meantime you guys can use the standard BnB one: https://huggingface.co/unsloth/Mistral-Small-3.1-24B-Instruct-2503-bnb-4bit
I have the same issue, I don't know the arguments to serve this model... Anyone? Thanks!