How do I serve a model in the original folder as bf16 in VLLM?

#60
by bakch92 - opened

I'm currently using an A100 GPU, which doesn't support MXFP4, so I'm trying to serve it as bf16. I think there's an unquantized model in the original folder. How do I serve it as VLLM?

I'm also trying to serve it on A100s with vLLM. I was considering llama.cpp as an alternative

Here's an example using vllm to serve the model on h100 on the gpu cluster http://playground.tracto.ai/playground?pr=notebooks/bulk-inference-gpt-oss-120b .
feel free to look/modify the code

You might want to look at the official cookbook by OpenAI about vllm: https://cookbook.openai.com/articles/gpt-oss/run-vllm and their official documentation here as well: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html

AFAICT it should automatically work with:

docker run --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:gptoss \
    --model openai/gpt-oss-20b

Sign up or log in to comment