zai-org/GLM-4.5-Air-FP8 · loaded succesfully but no response from vllm

21 days ago

i loaded Air-FP8 in vllm (APIServer pid=945236) INFO 08-03 01:16:48 [api_server.py:1846] Starting vLLM API server 0 on http://0.0.0.0:8001 ... [chat_utils.py:468] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this. (APIServer pid=945236) INFO: 127.0.0.1:38000 - "POST /v1/chat/completions HTTP/1.1" 200 OK it doesn't send response back to openwebui. It seems to recieve the request but nothing comes back. In open webui i see only SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data

Fernanda24

19 days ago

ok so someone gave me a tip about the pipeline parallelism and it worked! so this command works here below:CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2 vllm serve /mnt/2king/llama-models/zai-org/GLM-4.5-Air-FP8 --gpu-memory-utilization 0.94 --port 8001 --served-model-name GLM-4.5-Air --kv-cache-dtype fp8 --pipeline-parallel-size 3 --tensor-parallel-size 1 but this command below here loads it and then doesnt work but doesnt give error just gets stuck and never sends response: CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2 vllm serve /mnt/2king/llama-models/zai-org/GLM-4.5-Air-FP8 --gpu-memory-utilization 0.94 --port 8001 --served-model-name GLM-4.5-Air --kv-cache-dtype fp8 --tensor-parallel-size 2 the difference is in thew first example that now works i use --pipeline-parallel-size 3 --tensor-parallel-size 1 and the second that loads without error but still doesnt work is --tensor-parallel-size 2

Fernanda24 changed discussion status to closed 19 days ago