Whats the difference between this and autoround_v0?

by pty819 - opened Jul 28

Jul 28

Hi! Im confused with this repo and another your repo. what's the difference?
And, can I use this repo on my nvidia card?

wenhuach

Intel org Jul 28

Sorry for the confusion. Based on our experience, this version is likely to perform slightly better than v0, as v0 was generated using auto-round-light tuning, while this one uses auto-round.

All models with the 'auto-round' suffix are compatible with the devices we support (CPU / Intel GPU / CUDA/HPU(limited support)), unless explicitly stated in the README.

pty819

Jul 28

Nice, thank you!

Sorry for the confusion. Based on our experience, this version is likely to perform slightly better than v0, as v0 was generated using auto-round-light tuning, while this one uses auto-round.

All models with the 'auto-round' suffix are compatible with the devices we support (CPU / Intel GPU / CUDA/HPU(limited support)), unless explicitly stated in the README.

pty819

Jul 28

Sorry for the confusion. Based on our experience, this version is likely to perform slightly better than v0, as v0 was generated using auto-round-light tuning, while this one uses auto-round.

All models with the 'auto-round' suffix are compatible with the devices we support (CPU / Intel GPU / CUDA/HPU(limited support)), unless explicitly stated in the README.

Hi! I want to serve this model with vllm on my 4* rtx a6000 card, but seems this readme only gave an example for transformers. can you please update the vllm serve launch parameters?

pty819

Jul 28

•

edited Jul 28

root@ubuntu:/data/vllminfer# CUDA_VISIBLE_DEVICES=4,5,6,7 uv run vllm serve /mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/ -tp=4 --enable-expert-parallel
INFO 07-28 18:30:40 [init.py:244] Automatically detected platform cuda.
INFO 07-28 18:30:43 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-28 18:30:43 [cli_args.py:325] non-default args: {'model': '/mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/', 'tensor_parallel_size': 4, 'enable_expert_parallel': True}
INFO 07-28 18:30:49 [config.py:841] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 07-28 18:30:49 [config.py:1472] Using max model len 262144
WARNING 07-28 18:30:49 [config.py:960] auto-round quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 07-28 18:30:49 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 18:30:54 [init.py:244] Automatically detected platform cuda.
INFO 07-28 18:30:56 [core.py:526] Waiting for init message from front-end.
INFO 07-28 18:30:56 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='/mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/', speculative_config=None, tokenizer='/mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=auto-round, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-28 18:30:56 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 255 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-28 18:30:56 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_7f238b9f'), local_subscribe_addr='ipc:///tmp/e8fdc63c-f016-41d3-afae-ac90fa177bfb', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-28 18:31:00 [init.py:244] Automatically detected platform cuda.
INFO 07-28 18:31:00 [init.py:244] Automatically detected platform cuda.
INFO 07-28 18:31:00 [init.py:244] Automatically detected platform cuda.
INFO 07-28 18:31:00 [init.py:244] Automatically detected platform cuda.
(VllmWorker rank=2 pid=1703195) INFO 07-28 18:31:03 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_acd44584'), local_subscribe_addr='ipc:///tmp/a8a04006-e5f7-4b6b-bb55-59f4275dc33f', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:03 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1fab8c6e'), local_subscribe_addr='ipc:///tmp/65afff10-9624-4f94-87f2-392c3c544e7c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=1703196) INFO 07-28 18:31:03 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ac5e5c24'), local_subscribe_addr='ipc:///tmp/f784d1cd-0c0c-4aab-ad0f-dffe337657bb', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=1703194) INFO 07-28 18:31:03 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_6fb3a263'), local_subscribe_addr='ipc:///tmp/32ae12d6-e8c1-4b9e-8a00-22f357937420', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=1703195) INFO 07-28 18:31:04 [init.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=1703195) INFO 07-28 18:31:04 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=1703196) INFO 07-28 18:31:04 [init.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=1703196) INFO 07-28 18:31:04 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=1703194) INFO 07-28 18:31:04 [init.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=1703194) INFO 07-28 18:31:04 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:04 [init.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:04 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=1703196) WARNING 07-28 18:31:05 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=1703193) WARNING 07-28 18:31:05 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=2 pid=1703195) WARNING 07-28 18:31:05 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=1 pid=1703194) WARNING 07-28 18:31:05 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:05 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_c8b42821'), local_subscribe_addr='ipc:///tmp/cb6c92b8-e1ea-4274-94fa-a23c8325cee1', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=1703194) INFO 07-28 18:31:05 [parallel_state.py:1076] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:05 [parallel_state.py:1076] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=2 pid=1703195) INFO 07-28 18:31:05 [parallel_state.py:1076] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker rank=3 pid=1703196) INFO 07-28 18:31:05 [parallel_state.py:1076] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker rank=1 pid=1703194) WARNING 07-28 18:31:05 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=3 pid=1703196) WARNING 07-28 18:31:05 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=2 pid=1703195) WARNING 07-28 18:31:05 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=1703193) WARNING 07-28 18:31:05 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=1703194) INFO 07-28 18:31:05 [gpu_model_runner.py:1770] Starting to load model /mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/...
(VllmWorker rank=3 pid=1703196) INFO 07-28 18:31:05 [gpu_model_runner.py:1770] Starting to load model /mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/...
(VllmWorker rank=2 pid=1703195) INFO 07-28 18:31:05 [gpu_model_runner.py:1770] Starting to load model /mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/...
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:05 [gpu_model_runner.py:1770] Starting to load model /mnt/data/models/Qwen3-235B-A22B-Thinking-2507-int4-mixed-AutoRound/...
(VllmWorker rank=2 pid=1703195) INFO 07-28 18:31:05 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=1 pid=1703194) INFO 07-28 18:31:05 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:05 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:05 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker rank=2 pid=1703195) INFO 07-28 18:31:05 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker rank=1 pid=1703194) INFO 07-28 18:31:05 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker rank=0 pid=1703193) INFO 07-28 18:31:05 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=1703193) WARNING 07-28 18:31:05 [config.py:440] MoE DP setup unable to determine quantization scheme or unsupported quantization type. This model will not run with DP enabled.
(VllmWorker rank=2 pid=1703195) INFO 07-28 18:31:05 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=1703195) WARNING 07-28 18:31:05 [config.py:440] MoE DP setup unable to determine quantization scheme or unsupported quantization type. This model will not run with DP enabled.
(VllmWorker rank=1 pid=1703194) INFO 07-28 18:31:05 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=1703194) WARNING 07-28 18:31:05 [config.py:440] MoE DP setup unable to determine quantization scheme or unsupported quantization type. This model will not run with DP enabled.
(VllmWorker rank=3 pid=1703196) INFO 07-28 18:31:05 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=3 pid=1703196) INFO 07-28 18:31:05 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker rank=3 pid=1703196) INFO 07-28 18:31:05 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=1703196) WARNING 07-28 18:31:05 [config.py:440] MoE DP setup unable to determine quantization scheme or unsupported quantization type. This model will not run with DP enabled.
Loading safetensors checkpoint shards: 0% Completed | 0/27 [00:00<?, ?it/s]
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] self.worker.load_model()
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] self.model_runner.load_model()
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1776, in load_model
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] self.model = model_loader.load_model(
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] self.load_weights(model, model_config)
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] loaded_weights = model.load_weights(
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 541, in load_weights
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] return loader.load_weights(weights)
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] yield from self._load_module(prefix,
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] loaded_params = module_load_weights(weights)
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 475, in load_weights
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] param = params_dict[name]
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ~~~~~~~~~~~^^^^^^
(VllmWorker rank=0 pid=1703193) ERROR 07-28 18:31:06 [multiproc_executor.py:487] KeyError: 'layers.0.mlp.gate.qweight'
Loading safetensors checkpoint shards: 0% Completed | 0/27 [00:00<?, ?it/s]
(VllmWorker rank=0 pid=1703193)
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] self.worker.load_model()
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] self.model_runner.load_model()
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1776, in load_model
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] self.model = model_loader.load_model(
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] self.load_weights(model, model_config)
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] loaded_weights = model.load_weights(
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 541, in load_weights
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] return loader.load_weights(weights)
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] yield from self._load_module(prefix,
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] loaded_params = module_load_weights(weights)
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 475, in load_weights
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] param = params_dict[name]
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] ~~~~~~~~~~~^^^^^^
(VllmWorker rank=2 pid=1703195) ERROR 07-28 18:31:06 [multiproc_executor.py:487] KeyError: 'layers.0.mlp.gate.qweight'
[rank0]:[W728 18:31:07.855164156 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank3]:[W728 18:31:08.756202446 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=102, addr=[localhost]:43744, remote=[localhost]:58769): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f821cf785e8 in /data/vllminfer/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5ba8afe (0x7f8205e5aafe in /data/vllminfer/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5baae40 (0x7f8205e5ce40 in /data/vllminfer/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5bab74a (0x7f8205e5d74a in /data/vllminfer/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x2a9 (0x7f8205e571a9 in /data/vllminfer/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7f81c30509a9 in /data/vllminfer/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdc253 (0x7f81b2fd8253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x94ac3 (0x7f821de62ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x126850 (0x7f821def4850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[W728 18:31:08.762847629 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
ERROR 07-28 18:31:10 [core.py:586] EngineCore failed to start.
ERROR 07-28 18:31:10 [core.py:586] Traceback (most recent call last):
ERROR 07-28 18:31:10 [core.py:586] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
ERROR 07-28 18:31:10 [core.py:586] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-28 18:31:10 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 18:31:10 [core.py:586] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 404, in init
ERROR 07-28 18:31:10 [core.py:586] super().init(vllm_config, executor_class, log_stats,
ERROR 07-28 18:31:10 [core.py:586] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 75, in init
ERROR 07-28 18:31:10 [core.py:586] self.model_executor = executor_class(vllm_config)
ERROR 07-28 18:31:10 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 18:31:10 [core.py:586] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in init
ERROR 07-28 18:31:10 [core.py:586] self._init_executor()
ERROR 07-28 18:31:10 [core.py:586] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
ERROR 07-28 18:31:10 [core.py:586] self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 07-28 18:31:10 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 18:31:10 [core.py:586] File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
ERROR 07-28 18:31:10 [core.py:586] raise e from None
ERROR 07-28 18:31:10 [core.py:586] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
File "/data/python/python3.12.9/install/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/data/python/python3.12.9/install/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 590, in run_engine_core
raise e
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 404, in init
super().init(vllm_config, executor_class, log_stats,
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 75, in init
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in init
self._init_executor()
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
self.workers = WorkerProc.wait_for_ready(unready_workers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
File "/data/vllminfer/.venv/bin/vllm", line 10, in
sys.exit(main())
^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 65, in main
args.dispatch_function(args)
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd
uvloop.run(run_server(args))
File "/data/vllminfer/.venv/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/data/python/python3.12.9/install/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/data/python/python3.12.9/install/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/data/vllminfer/.venv/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/python/python3.12.9/install/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/python/python3.12.9/install/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
return cls(
^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 124, in init
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client
return AsyncMPClient(*client_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 666, in init
super().init(
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 403, in init
with launch_core_engines(vllm_config, executor_class,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/python/python3.12.9/install/lib/python3.12/contextlib.py", line 144, in exit
next(self.gen)
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines
wait_for_engine_startup(
File "/data/vllminfer/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/data/python/python3.12.9/install/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

I'm using vllm 0.9.2. Let me try to update my vllm version to 0.10.0 and try later

wenhuach

Intel org Jul 28

This is the kernel issue of vllm to support moe model we have noticed for a long time. We are trying the latest version of vllm to check what we could do.

https://github.com/vllm-project/vllm/pull/17850

Known issues
Mixed bits support is limited
Mixed-bit quantization is currently limited. Since vLLM fuses layers (e.g., QKV), applying different bit-widths to components within the same fused layer can lead to incompatibility issues.

Quantized MOE model support is limited
Qwen3-30B-A3B: KeyError: layers.45.mlp.gate.qweight', gptq format has the same issue, while awq reports assert self.quant_method is not None

deepseek-moe-16b-base: The input size is not aligned with the quantized weight shape, or mergedColumnParallelLinear object has no attribute 'weight') , Same issues are exists for awq and gptq

Quantized vlms support is limited
the module names may be different from that of Transformers, this may introduce risk to parse the quantization config correctly

OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc: marlin kernel has issues, need to fallback to gptq kernel

Qwen2.5-VL-7B : auto_round:auto_gptq format failed with marlin and gptq kernel both. gptq model has the similar issue. auto_round:auto_awq and awq format are fine

pty819

Jul 28

This is the kernel issue of vllm to support moe model we have noticed for a long time. We are trying the latest version of vllm to check what we could do.

https://github.com/vllm-project/vllm/pull/17850

Known issues
Mixed bits support is limited
Mixed-bit quantization is currently limited. Since vLLM fuses layers (e.g., QKV), applying different bit-widths to components within the same fused layer can lead to incompatibility issues.

Quantized MOE model support is limited
Qwen3-30B-A3B: KeyError: layers.45.mlp.gate.qweight', gptq format has the same issue, while awq reports assert self.quant_method is not None

deepseek-moe-16b-base: The input size is not aligned with the quantized weight shape, or mergedColumnParallelLinear object has no attribute 'weight') , Same issues are exists for awq and gptq

Quantized vlms support is limited
the module names may be different from that of Transformers, this may introduce risk to parse the quantization config correctly

OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc: marlin kernel has issues, need to fallback to gptq kernel

Qwen2.5-VL-7B : auto_round:auto_gptq format failed with marlin and gptq kernel both. gptq model has the similar issue. auto_round:auto_awq and awq format are fine

all right thank you for your hardworking but seems the vllm 0.10 still let this bug unfixed......

(VllmWorker rank=4 pid=1709180) ERROR 07-28 18:46:46 [multiproc_executor.py:511] KeyError: 'layers.0.mlp.gate.qweight'

wenhuach

Intel org Jul 28

•

edited Jul 28

This is the kernel issue of vllm to support moe model we have noticed for a long time. We are trying the latest version of vllm to check what we could do.

https://github.com/vllm-project/vllm/pull/17850

Known issues
Mixed bits support is limited
Mixed-bit quantization is currently limited. Since vLLM fuses layers (e.g., QKV), applying different bit-widths to components within the same fused layer can lead to incompatibility issues.

Quantized MOE model support is limited
Qwen3-30B-A3B: KeyError: layers.45.mlp.gate.qweight', gptq format has the same issue, while awq reports assert self.quant_method is not None

deepseek-moe-16b-base: The input size is not aligned with the quantized weight shape, or mergedColumnParallelLinear object has no attribute 'weight') , Same issues are exists for awq and gptq

Quantized vlms support is limited
the module names may be different from that of Transformers, this may introduce risk to parse the quantization config correctly

OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc: marlin kernel has issues, need to fallback to gptq kernel

Qwen2.5-VL-7B : auto_round:auto_gptq format failed with marlin and gptq kernel both. gptq model has the similar issue. auto_round:auto_awq and awq format are fine

all right thank you for your hardworking but seems the vllm 0.10 still let this bug unfixed......

(VllmWorker rank=4 pid=1709180) ERROR 07-28 18:46:46 [multiproc_executor.py:511] KeyError: 'layers.0.mlp.gate.qweight'

after checking the code and conducting some experiments, a simple workaround is fallback "mlp.gate" layer. We will regenerate the model and upload it later

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound

model_name = "/dataset/Qwen3-235B-A22B-Instruct-2507-test"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="cpu", torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)

layer_config = {}
for n, m in model.named_modules():
    print(n,m.__class__.__name__)
    if "mlp.gate" in n:
        layer_config[n] = {"bits":16}
autoround = AutoRound(model, tokenizer, iters=0, group_size=64, layer_config=layer_config)
autoround.quantize_and_save("/data5/wenhuach/Qwen3-235B-A22B-Instruct-2507-test-1")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment