vllm

#4
by NikolaSigmoid - opened

I launched VLLM (8 x H100 SXM) using the following command to start the server:

vllm serve --model OPEA/DeepSeek-V3-int4-sym-gptq-inc-preview --tensor-parallel-size 8 --max-model-len 16384 --max_num_seqs 1 --trust-remote-code

Then, I made a simple “Hello” request like this:

chat_completion = client.chat.completions.create(
    model="OPEA/DeepSeek-V3-int4-sym-gptq-inc-preview",
    messages=[
        {
            "role": "user",
            "content": "Hello"
        },
    
    ],
    stream = True,
)

However, the output I received was quite strange, and I’m unsure why this is happening. Any insights into what might be causing this anomaly would be greatly appreciated.

AAAA隨後ffi®AAAAAAAAÄAAAAAAAAICAgICAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAflaaaaaaaaAAAAAh–
AAAAAAAAAAAAепflaaaaAAAA–
 AAA mootÎ AAA AAAÎ巴克ÎiaisÎðStringÍABCDíð®AAA GuineaPokAAAAifiÎffffffialiíí @)]DotíÎ84ok安置куп Î HSÎÎ708414 followers Macrom Pembíð Including756HKiOS®157Indonesia Ronald HN仅有 ®ízi ÎIEEE
...
Open Platform for Enterprise AI org

I will have a test on CPU side. Could you also have a test of our verified prompts in readme?
As we have no enough cuda resource, we could not test it on cuda side.

Alright, I’ll test loading the model using Transformers, but it seems quite slow. How long did it take you to load this model with Transformers?

Open Platform for Enterprise AI org

Alright, I’ll test loading the model using Transformers, but it seems quite slow. How long did it take you to load this model with Transformers?

It takes over 20 minutes. What I suggest is testing the validated prompts in vLLMs first. If the results aren't satisfactory, it might be better to try them in Transformers. If neither approach works, I suspect an overflow issue has occurred, similar to what we encountered with Qwen2.5-32B earlier, https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-awq-inc. Then we will try to quantize the model in mixed precision way.

Could you please explain how you managed to quantize this model without sufficient GPU resources?

Oh no, I just waited for a long time and got this error !

Screenshot 2025-01-02 at 16.55.45.png

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 7 has a total capacity of 79.10 GiB of which 7.88 MiB is free. Process 222887 has 79.08 GiB memory in use. Of the allocated memory 56.03 GiB is allocated by PyTorch, and 22.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Screenshot 2025-01-02 at 17.07.20.png

Here is the result I obtained after launching with VLLMs and testing your example prompt.

Open Platform for Enterprise AI org
edited 16 days ago

Could you please explain how you managed to quantize this model without sufficient GPU resources?
we have not tested it on GPU, we only tested it on CPU

Open Platform for Enterprise AI org

Screenshot 2025-01-02 at 17.07.20.png

Here is the result I obtained after launching with VLLMs and testing your example prompt.

Thank you for the information. We will verify if the input or output of certain layers of this model has exceeded the FP16 range.

Open Platform for Enterprise AI org
edited 16 days ago

have a quick test

model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
model.layers.59.mlp.experts.138.down_proj tensor(1096.) tensor(130.5271)
model.layers.60.mlp.experts.81.down_proj tensor(6016.) tensor(8290.2236)
model.layers.60.mlp.experts.92.down_proj tensor(111616.) tensor(52362.3281)
model.layers.59.mlp.experts.138.down_proj tensor(1056.) tensor(125.2802)
model.layers.60.mlp.experts.81.down_proj tensor(5184.) tensor(7294.0933)
model.layers.60.mlp.experts.92.down_proj tensor(108032.) tensor(51036.6992)
model.layers.60.mlp.experts.81.down_proj tensor(4352.) tensor(6245.4785)
model.layers.60.mlp.experts.92.down_proj tensor(101888.) tensor(48230.1445)
model.layers.59.mlp.experts.138.down_proj tensor(1064.) tensor(124.9290)
model.layers.60.mlp.experts.81.down_proj tensor(5920.) tensor(8268.7275)
model.layers.60.mlp.experts.92.down_proj tensor(110592.) tensor(52426.2188)
model.layers.60.mlp.experts.81.down_proj tensor(5472.) tensor(7656.2739)
model.layers.60.mlp.experts.81.down_proj tensor(5472.) tensor(7656.2739)
model.layers.60.mlp.experts.92.down_proj tensor(107008.) tensor(50818.8711)
model.layers.60.mlp.experts.81.down_proj tensor(5760.) tensor(7966.4805)
model.layers.60.mlp.experts.92.down_proj tensor(107008.) tensor(51374.0078)
model.layers.60.mlp.experts.81.down_proj tensor(6688.) tensor(9049.8135)
model.layers.60.mlp.experts.92.down_proj tensor(117760.) tensor(55190.7734)

Maybe we need to exclude all the down proj in the last layer

Thank you so much! Could you please prioritize this? I’m eagerly anticipating the INT4 version of this model for deployment!

Open Platform for Enterprise AI org
edited 16 days ago

Thank you so much! Could you please prioritize this? I’m eagerly anticipating the INT4 version of this model for deployment!

working on it.

I get this error when starting vllm, can you help me?

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-133701.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False

Your command?

vllm serve --max-model-len 16384 --max_num_seqs 1 --trust_remote_code --tensor-parallel-size 4 OPEA/DeepSeek-V3-int4-sym-inc-cpu

Open Platform for Enterprise AI org

@NikolaSigmoid we have added a workaround for the overflow issue. Please have a try of the latest model. We have validated it with Transformers, but we have no enough resource to validate it on vllms.

Screenshot 2025-01-05 at 15.18.55.png
Still not working =))

@NikolaSigmoid Im also working with vLLM. Maybe you could help me figure some things out :) I have access to only A100s, and I have 9 nodes with 2 GPUs (so 18 A100s of 80 GBs, 1440 GBs in total). I was trying to serve a bf16 version I found here on HF, but I am getting CUDA OOM... even though 685B params should be somewhere around 1350 GBs plus some overhead. Any thoughts? I am also trying to unload to CPU but not working either...

vllm serve opensourcerelease/DeepSeek-V3-bf16
--dtype bfloat16
--host 0.0.0.0
--port 5000
--gpu-memory-utilization 0.7
--cpu-offload-gb 540
--tensor-parallel-size 2
--pipeline-parallel-size 9
--trust-remote-code

any thoughts? pls help :(

Open Platform for Enterprise AI org

Screenshot 2025-01-05 at 15.18.55.png
Still not working =))

Thank you for the information. I guess the issue might be related to the Marlin kernel they used. Unfortunately, we don’t have enough GPUs to test it ourselves. For now, you can try using this model with Transformers or on a CPU. If you’re unable to reproduce the results , please let us know.

Open Platform for Enterprise AI org

Screenshot 2025-01-05 at 15.18.55.png
Still not working =))

another workaround you cloud try is changing the code https://huggingface.co/OPEA/DeepSeek-V3-int4-sym-gptq-inc/blob/main/modeling_deepseek.py#L389 to

down_proj = self.down_proj((self.act_fn(self.gate_proj(x))/2.0) * (self.up_proj(x))/2.0)*4.0

We have validated this and achieved similar results in Transformers. And you could also change the 2.0,2.0,4.0 to 4.0,4.0, 16.0

Hi @cicdatopea I tried to patch this .py file manually based on you suggestions but unfortunately I'm still seeing similar errors

{"id":"cmpl-2bd40cef6c534ff99023de26ebf5517e","object":"text_completion","created":1736459377,"model":".","choices":[{"index":0,"text":"íííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííí","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}
Open Platform for Enterprise AI org

Hi @cicdatopea I tried to patch this .py file manually based on you suggestions but unfortunately I'm still seeing similar errors

{"id":"cmpl-2bd40cef6c534ff99023de26ebf5517e","object":"text_completion","created":1736459377,"model":".","choices":[{"index":0,"text":"íííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííí","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}

Hi @abitsu , thank you for the information. Unfortunately, we don't have sufficient hardware to run this model on vLLM. You might consider trying the quantized models provided by other teams.

Got it, thanks anyway!

Same problem. Tried various vllm configs. Unfortunately, none of them work.
May caused by gptq_marlin quantification kernel.

Sign up or log in to comment