running in vllm gives error

by GrigoriiA - opened 25 days ago

25 days ago

•

Did you actually run it in vLLM? It requires dtype=float16, and still cannot run, gives assertion error about quantization method, I think it means that it's not supported for this model in vLLM yet. vLLM version is 0.8.5.
If you run it - which parameters did you use?
Thanks.
This is the end of the error -

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 123, in __init__
[rank0]:     self.experts = FusedMoE(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 499, in __init__
[rank0]:     assert self.quant_method is not None
[rank0]: AssertionError```

GrigoriiA

25 days ago

Got it working. If anyone will have this problem, the parameter "quantization" should be "awq_marlin", not "awq".

adamo1139

Owner 24 days ago

Hi.

Yes, I did run it in vLLM 0.9.0.1 as well as 0.8.5 on 8x H100, fresh vLLM install on fresh Ubuntu 22.04. Simple command vllm serve adamo1139/DeepSeek-R1-0528-AWQ --tensor-parallel 8 was enough to make it work as vLLM figures out on it's own to use the awq_marlin kernel presumably also the right dtype. For what it's worth, it loads in fine for me with both --dtype float16 and --dtype bfloat16 What GPUs were you using?

GrigoriiA

24 days ago

•

edited 22 days ago

I used 4x H200. That's enough memory-wise.
vLLM v0.8.5, tensor_parallel=4, dtype=float16, quantization=awq_marlin. With these parameters it works.
Tried it on runpod.io's serverless, makes no sense to use it at least not with network volumes, because load time is more than 1 minute.

adamo1139

Owner 22 days ago

I'm not able to replicate that - when running vLLM 0.8.5 (vllm serve) on 4x H200 (vast.ai) with tensor parallel 2 and awq_marlin quantization, I get OOM. With --tensor-parallel 4 it works. Are you using it with offline inference or vllm serve? If it's offline inference, can you share the relevant code snippet?

GrigoriiA

22 days ago

I'm sorry, I noticed and corrected my typo. Tensor parallel was 4 of course.
As I stated in my 2nd message, I got it working. The setup was 4x H200, runpod.io with runpod's vllm docker container of vllm 0.8.5, with --tensor-parallel 4 and awq_marlin.
That setup didn't work with quantization set to awq, and that was my problem. I changed it to awq_marlin, and it worked.
Sorry for any confusion.

adamo1139

Owner 22 days ago

I got confused a bit too and forgot about awq_marlin being the focus of the issue. I updated the readme.

adamo1139 changed discussion status to closed 22 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment