how i use this version model in vllm serve

#1
by couldn - opened

CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model mistral-small-24b-bnb-4bit --max_model_len=20000 --port 8080 --quantization bitsandbytes --load-format bitsandbytes --tokenizer_mode mistral --config_format mistral --tool-call-parser mistral --enable-auto-tool-choice
this command is not ok

I have the same doubt. I would like to know if this is possible or if we have to wait for an update of vLLM.

My bad, wrong answer...

Unsloth AI org

CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server --model mistral-small-24b-bnb-4bit --max_model_len=20000 --port 8080 --quantization bitsandbytes --load-format bitsandbytes --tokenizer_mode mistral --config_format mistral --tool-call-parser mistral --enable-auto-tool-choice
this command is not ok

I have the same doubt. I would like to know if this is possible or if we have to wait for an update of vLLM.

My bad, wrong answer...

In the meantime you guys can use the standard BnB one: https://huggingface.co/unsloth/Mistral-Small-3.1-24B-Instruct-2503-bnb-4bit

I have the same issue, I don't know the arguments to serve this model... Anyone? Thanks!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment