vLLM launch params? Tool calling?
Hi everyone,
I’m trying to serve sarvamai/sarvam-m with the latest dev build of vLLM (0.9.1, CUDA back-end).
Launch command:
python -m vllm.entrypoints.openai.api_server \
--model sarvamai/sarvam-m \
--tokenizer_mode mistral \
--load_format mistral \
--tensor_parallel_size 4 \
--enable_auto_tool_choice \
--tool_call_parser mistral \
--enable_prefix_caching \
--port 8080 \
--trust_remote_code
Log tail:
Traceback (most recent call last):
...
OSError: Found 0 files matching the pattern: ^tokenizer\.model\.v.*$|^tekken\.json$|^tokenizer\.mm\.model\.v.*$
Make sure that a Mistral tokenizer is present in
['tokenizer.json', 'tokenizer_config.json', ...]
What I think is happening
MistralTokenizer inside vLLM only recognises the official Mistral files (tekken.json, tokenizer.model.v*, etc.).
The Sarvam repo ships just the generic Hugging-Face tokenizer.json, so the scan returns nothing and the loader raises the OSError.
Questions
1. Is the quick fix simply
--tokenizer mistralai/Mistral-Small-24B-Instruct-2501
so vLLM grabs the tokenizer from the original base model while still loading Sarvam’s weights?
2. Switching to auto makes vLLM fall back to the Hugging Face fast tokenizer. Generation works, but you lose Mistral-3–specific extras (notably the function-calling / tool-calling special-token prefixes), so tool calls are no longer parsed.
3. Any other launch parameters I should tweak for a 4-GPU (H100 80 GB each) setup?
Thanks for any pointers!
— Aditya