How do I serve a model in the original folder as bf16 in VLLM?

#60

by bakch92 - opened Aug 6

Aug 6

I'm currently using an A100 GPU, which doesn't support MXFP4, so I'm trying to serve it as bf16. I think there's an unquantized model in the original folder. How do I serve it as VLLM?

fsaudm

Aug 6

I'm also trying to serve it on A100s with vLLM. I was considering llama.cpp as an alternative

mburkov

Aug 6

Here's an example using vllm to serve the model on h100 on the gpu cluster http://playground.tracto.ai/playground?pr=notebooks/bulk-inference-gpt-oss-120b .
feel free to look/modify the code

reach-vb

Aug 6

You might want to look at the official cookbook by OpenAI about vllm: https://cookbook.openai.com/articles/gpt-oss/run-vllm and their official documentation here as well: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html

AFAICT it should automatically work with:

docker run --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:gptoss \
    --model openai/gpt-oss-20b

sn3d9

Aug 6

Unsloth https://huggingface.co/unsloth/gpt-oss-120b-GGUF

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment