How to run in vLLM
Can you please update the instructions on how to run this quantized model in vLLM
Thanks
try using sglang, it has vllm backend
python3 -m sglang.launch_server
--served-model-name tonjoo-coder
--model-path unsloth/Devstral-Small-2505-bnb-4bit
--chat-template /models/devstral.jinja
--port 8000
--host 0.0.0.0
--mem-fraction-static 0.8
actualy i try tp=2 but not working, using tp=1 might work
You should be able to do it fine vllm serve unsloth/Devstral-Small-2505-bnb-4bit --quantization bitsandbytes --load-format bitsandbytes
see https://docs.vllm.ai/en/latest/features/quantization/bnb.html
You should be able to do it fine
vllm serve unsloth/Devstral-Small-2505-bnb-4bit --quantization bitsandbytes --load-format bitsandbytes
see https://docs.vllm.ai/en/latest/features/quantization/bnb.html
That worked, thanks!
try using sglang, it has vllm backend
python3 -m sglang.launch_server --served-model-name tonjoo-coder --model-path unsloth/Devstral-Small-2505-bnb-4bit --chat-template /models/devstral.jinja --port 8000 --host 0.0.0.0 --mem-fraction-static 0.8
actualy i try tp=2 but not working, using tp=1 might work
Hi @todiadiyatmo , what is your chat-template devstral.jinja?
it is on unsloth guide : https://docs.unsloth.ai/basics/tutorials-how-to-fine-tune-and-run-llms/devstral-how-to-run-and-fine-tune#tutorial-how-to-run-devstral-in-ollama. the guide has been update though, if not mistaken the file is this :
https://huggingface.co/unsloth/Devstral-Small-2505-GGUF/blob/main/template