vllm
I launched VLLM (8 x H100 SXM) using the following command to start the server:
vllm serve --model OPEA/DeepSeek-V3-int4-sym-gptq-inc-preview --tensor-parallel-size 8 --max-model-len 16384 --max_num_seqs 1 --trust-remote-code
Then, I made a simple “Hello” request like this:
chat_completion = client.chat.completions.create(
model="OPEA/DeepSeek-V3-int4-sym-gptq-inc-preview",
messages=[
{
"role": "user",
"content": "Hello"
},
],
stream = True,
)
However, the output I received was quite strange, and I’m unsure why this is happening. Any insights into what might be causing this anomaly would be greatly appreciated.
AAAA隨後ffi®AAAAAAAAÄAAAAAAAAICAgICAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAflaaaaaaaaAAAAAh–
AAAAAAAAAAAAепflaaaaAAAA–
AAA mootÎ AAA AAAÎ巴克ÎiaisÎðStringÍABCDíð®AAA GuineaPokAAAAifiÎffffffialiíí @)]DotíÎ84ok安置куп Î HSÎÎ708414 followers Macrom Pembíð Including756HKiOS®157Indonesia Ronald HN仅有 ®ízi ÎIEEE
...
I will have a test on CPU side. Could you also have a test of our verified prompts in readme?
As we have no enough cuda resource, we could not test it on cuda side.
Alright, I’ll test loading the model using Transformers, but it seems quite slow. How long did it take you to load this model with Transformers?
Alright, I’ll test loading the model using Transformers, but it seems quite slow. How long did it take you to load this model with Transformers?
It takes over 20 minutes. What I suggest is testing the validated prompts in vLLMs first. If the results aren't satisfactory, it might be better to try them in Transformers. If neither approach works, I suspect an overflow issue has occurred, similar to what we encountered with Qwen2.5-32B earlier, https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-awq-inc. Then we will try to quantize the model in mixed precision way.
Could you please explain how you managed to quantize this model without sufficient GPU resources?
Oh no, I just waited for a long time and got this error !
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 7 has a total capacity of 79.10 GiB of which 7.88 MiB is free. Process 222887 has 79.08 GiB memory in use. Of the allocated memory 56.03 GiB is allocated by PyTorch, and 22.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Could you please explain how you managed to quantize this model without sufficient GPU resources?
we have not tested it on GPU, we only tested it on CPU
have a quick test
model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
model.layers.59.mlp.experts.138.down_proj tensor(1096.) tensor(130.5271)
model.layers.60.mlp.experts.81.down_proj tensor(6016.) tensor(8290.2236)
model.layers.60.mlp.experts.92.down_proj tensor(111616.) tensor(52362.3281)
model.layers.59.mlp.experts.138.down_proj tensor(1056.) tensor(125.2802)
model.layers.60.mlp.experts.81.down_proj tensor(5184.) tensor(7294.0933)
model.layers.60.mlp.experts.92.down_proj tensor(108032.) tensor(51036.6992)
model.layers.60.mlp.experts.81.down_proj tensor(4352.) tensor(6245.4785)
model.layers.60.mlp.experts.92.down_proj tensor(101888.) tensor(48230.1445)
model.layers.59.mlp.experts.138.down_proj tensor(1064.) tensor(124.9290)
model.layers.60.mlp.experts.81.down_proj tensor(5920.) tensor(8268.7275)
model.layers.60.mlp.experts.92.down_proj tensor(110592.) tensor(52426.2188)
model.layers.60.mlp.experts.81.down_proj tensor(5472.) tensor(7656.2739)
model.layers.60.mlp.experts.81.down_proj tensor(5472.) tensor(7656.2739)
model.layers.60.mlp.experts.92.down_proj tensor(107008.) tensor(50818.8711)
model.layers.60.mlp.experts.81.down_proj tensor(5760.) tensor(7966.4805)
model.layers.60.mlp.experts.92.down_proj tensor(107008.) tensor(51374.0078)
model.layers.60.mlp.experts.81.down_proj tensor(6688.) tensor(9049.8135)
model.layers.60.mlp.experts.92.down_proj tensor(117760.) tensor(55190.7734)
Maybe we need to exclude all the down proj in the last layer
Thank you so much! Could you please prioritize this? I’m eagerly anticipating the INT4 version of this model for deployment!
Thank you so much! Could you please prioritize this? I’m eagerly anticipating the INT4 version of this model for deployment!
working on it.
I get this error when starting vllm, can you help me?
ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-133701.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
Your command?
vllm serve --max-model-len 16384 --max_num_seqs 1 --trust_remote_code --tensor-parallel-size 4 OPEA/DeepSeek-V3-int4-sym-inc-cpu
@NikolaSigmoid we have added a workaround for the overflow issue. Please have a try of the latest model. We have validated it with Transformers, but we have no enough resource to validate it on vllms.
@NikolaSigmoid Im also working with vLLM. Maybe you could help me figure some things out :) I have access to only A100s, and I have 9 nodes with 2 GPUs (so 18 A100s of 80 GBs, 1440 GBs in total). I was trying to serve a bf16 version I found here on HF, but I am getting CUDA OOM... even though 685B params should be somewhere around 1350 GBs plus some overhead. Any thoughts? I am also trying to unload to CPU but not working either...
vllm serve opensourcerelease/DeepSeek-V3-bf16
--dtype bfloat16
--host 0.0.0.0
--port 5000
--gpu-memory-utilization 0.7
--cpu-offload-gb 540
--tensor-parallel-size 2
--pipeline-parallel-size 9
--trust-remote-code
any thoughts? pls help :(
Thank you for the information. I guess the issue might be related to the Marlin kernel they used. Unfortunately, we don’t have enough GPUs to test it ourselves. For now, you can try using this model with Transformers or on a CPU. If you’re unable to reproduce the results , please let us know.
another workaround you cloud try is changing the code https://huggingface.co/OPEA/DeepSeek-V3-int4-sym-gptq-inc/blob/main/modeling_deepseek.py#L389 to
down_proj = self.down_proj((self.act_fn(self.gate_proj(x))/2.0) * (self.up_proj(x))/2.0)*4.0
We have validated this and achieved similar results in Transformers. And you could also change the 2.0,2.0,4.0 to 4.0,4.0, 16.0
Hi @cicdatopea I tried to patch this .py file manually based on you suggestions but unfortunately I'm still seeing similar errors
{"id":"cmpl-2bd40cef6c534ff99023de26ebf5517e","object":"text_completion","created":1736459377,"model":".","choices":[{"index":0,"text":"íííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííí","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}
Hi @cicdatopea I tried to patch this .py file manually based on you suggestions but unfortunately I'm still seeing similar errors
{"id":"cmpl-2bd40cef6c534ff99023de26ebf5517e","object":"text_completion","created":1736459377,"model":".","choices":[{"index":0,"text":"íííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííííí","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}
Hi @abitsu , thank you for the information. Unfortunately, we don't have sufficient hardware to run this model on vLLM. You might consider trying the quantized models provided by other teams.
Got it, thanks anyway!
Same problem. Tried various vllm configs. Unfortunately, none of them work.
May caused by gptq_marlin quantification kernel.