When running with a single GPU, I get an error saying the VRAM is insufficient. However, when using multiple GPUs on a single machine, there are many errors. My vllm version is 0.8.4.
(vllm_p) (base) ktkj@ktkj-ThinkStation-PX:~/uv_project/vllm_p$ vllm serve "/home/ktkj/origin_models/THUDM/GLM-4-32B-0414"
--tensor-parallel-size 2
--dtype bfloat16
INFO 04-18 14:32:38 [init.py:239] Automatically detected platform cuda.
INFO 04-18 14:32:39 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-18 14:32:39 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/ktkj/origin_models/THUDM/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/ktkj/origin_models/THUDM/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7cf100ec5ee0>)
INFO 04-18 14:32:44 [config.py:689] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 04-18 14:32:44 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-18 14:32:44 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-18 14:32:48 [init.py:239] Automatically detected platform cuda.
INFO 04-18 14:32:50 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='/home/ktkj/origin_models/THUDM/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/ktkj/origin_models/THUDM/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/ktkj/origin_models/THUDM/GLM-4-32B-0414, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-18 14:32:50 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 72 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-18 14:32:50 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_52585278'), local_subscribe_addr='ipc:///tmp/d90a47e1-6f13-4498-be1c-b59a5d9b26cb', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-18 14:32:52 [init.py:239] Automatically detected platform cuda.
WARNING 04-18 14:32:54 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x73d45b454110>
(VllmWorker rank=0 pid=127317) INFO 04-18 14:32:54 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_03cf8fca'), local_subscribe_addr='ipc:///tmp/8a96c300-8e47-44ab-832f-cd8172ab83e1', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-18 14:32:57 [init.py:239] Automatically detected platform cuda.
WARNING 04-18 14:32:59 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ab3532ce120>
(VllmWorker rank=1 pid=127397) INFO 04-18 14:32:59 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_7e095453'), local_subscribe_addr='ipc:///tmp/54edae1b-eb09-498c-8936-5c51b88d7af3', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=127397) INFO 04-18 14:32:59 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=127317) INFO 04-18 14:32:59 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=127397) INFO 04-18 14:32:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=127317) INFO 04-18 14:32:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=127317) INFO 04-18 14:32:59 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ktkj/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=1 pid=127397) INFO 04-18 14:32:59 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ktkj/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorker rank=1 pid=127397) WARNING 04-18 14:32:59 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=127317) WARNING 04-18 14:32:59 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=127317) INFO 04-18 14:32:59 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_17577620'), local_subscribe_addr='ipc:///tmp/6aa91493-c82a-42ba-b955-f3c4e0b60fce', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=127317) INFO 04-18 14:32:59 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=0 pid=127317) INFO 04-18 14:32:59 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=127397) INFO 04-18 14:32:59 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=1 pid=127397) INFO 04-18 14:32:59 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=127317) INFO 04-18 14:33:00 [gpu_model_runner.py:1276] Starting to load model /home/ktkj/origin_models/THUDM/GLM-4-32B-0414...
(VllmWorker rank=1 pid=127397) INFO 04-18 14:33:00 [gpu_model_runner.py:1276] Starting to load model /home/ktkj/origin_models/THUDM/GLM-4-32B-0414...
(VllmWorker rank=0 pid=127317) WARNING 04-18 14:33:00 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=127397) WARNING 04-18 14:33:00 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:09, 1.38it/s]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:01<00:09, 1.32it/s]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:02<00:08, 1.32it/s]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:03<00:07, 1.33it/s]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:03<00:06, 1.31it/s]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:04<00:06, 1.30it/s]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:05<00:05, 1.34it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:05<00:03, 1.64it/s]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:06<00:03, 1.55it/s]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:07<00:02, 1.47it/s]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:07<00:01, 1.51it/s]
Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:08<00:01, 1.48it/s]
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:09<00:00, 1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:09<00:00, 1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:09<00:00, 1.41it/s]
(VllmWorker rank=0 pid=127317)
(VllmWorker rank=1 pid=127397) INFO 04-18 14:33:10 [loader.py:458] Loading weights took 10.00 seconds
(VllmWorker rank=0 pid=127317) INFO 04-18 14:33:10 [loader.py:458] Loading weights took 10.00 seconds
(VllmWorker rank=1 pid=127397) INFO 04-18 14:33:10 [gpu_model_runner.py:1291] Model loading took 30.4522 GiB and 10.174721 seconds
(VllmWorker rank=0 pid=127317) INFO 04-18 14:33:10 [gpu_model_runner.py:1291] Model loading took 30.4522 GiB and 10.175404 seconds
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] WorkerProc hit an exception: %s
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] Traceback (most recent call last):
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2586, in run_node
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] return node.target(*args, **kwargs)
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] The above exception was the direct cause of the following exception:
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] Traceback (most recent call last):
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2471, in get_fake_value
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ret_val = wrap_fake_exception(
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2017, in wrap_fake_exception
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] return fn()
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2472, in
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] lambda: run_node(tx.output, node, args, kwargs, nnmodule)
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2604, in run_node
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] raise RuntimeError(make_error_message(e)).with_traceback(
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2586, in run_node
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] return node.target(*args, *kwargs)
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] RuntimeError: Failed running call_function (((FakeTensor(..., device='cuda:1', size=(s0, 6144), dtype=torch.bfloat16), FakeTensor(..., device='cuda:1', size=(s0, 6144), dtype=torch.bfloat16)), Parameter(FakeTensor(..., device='cuda:1', size=(23040, 6144), dtype=torch.bfloat16)), None), **{}):
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] linear(): argument 'input' (position 1) must be Tensor, not tuple
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] During handling of the above exception, another exception occurred:
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] Traceback (most recent call last):
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in worker_busy_loop
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] output = func(*args, **kwargs)
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] return func(*args, **kwargs)
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] self.model_runner.profile_run()
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] self.call_function(fn, args, kwargs)
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 897, in call_function
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type]
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/variables/nn_module.py", line 914, in call_function
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] return variables.UserFunctionVariable(fn, source=source).call_function(
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py", line 317, in call_function
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] return super().call_function(tx, args, kwargs)
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=127397) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2534, in get_fake_value
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] unimplemented(f"TypeError {node.target}: {cause}")
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 317, in unimplemented
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] raise Unsupported(msg, case_name=case_name)
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] torch.dynamo.exc.Unsupported: TypeError : linear(): argument 'input' (position 1) must be Tensor, not tuple
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] from user code:
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] hidden_states, residual = layer(positions, hidden_states, residual)
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] hidden_states = self.mlp(hidden_states)
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] x, _ = self.gate_up_proj(x)
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] output_parallel = self.quant_method.apply(self, input, bias)
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] return F.linear(x, layer.weight, bias)
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] You can suppress this exception and fall back to eager by setting:
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] import torch._dynamo
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380] torch._dynamo.config.suppress_errors = True
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
(VllmWorker rank=0 pid=127317) ERROR 04-18 14:33:10 [multiproc_executor.py:380]
ERROR 04-18 14:33:10 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-18 14:33:10 [core.py:387] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-18 14:33:10 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-18 14:33:10 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-18 14:33:10 [core.py:387] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 320, in init
ERROR 04-18 14:33:10 [core.py:387] super().init(vllm_config, executor_class, log_stats)
ERROR 04-18 14:33:10 [core.py:387] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 71, in init
ERROR 04-18 14:33:10 [core.py:387] self._initialize_kv_caches(vllm_config)
ERROR 04-18 14:33:10 [core.py:387] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 133, in initialize_kv_caches
ERROR 04-18 14:33:10 [core.py:387] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-18 14:33:10 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-18 14:33:10 [core.py:387] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 04-18 14:33:10 [core.py:387] output = self.collective_rpc("determine_available_memory")
ERROR 04-18 14:33:10 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-18 14:33:10 [core.py:387] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in collective_rpc
ERROR 04-18 14:33:10 [core.py:387] raise e
ERROR 04-18 14:33:10 [core.py:387] File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in collective_rpc
ERROR 04-18 14:33:10 [core.py:387] raise RuntimeError(
ERROR 04-18 14:33:10 [core.py:387] RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause', 'TypeError : linear(): argument 'input' (position 1) must be Tensor, not tuple\n\nfrom user code:\n File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 360, in forward\n hidden_states, residual = layer(positions, hidden_states, residual)\n File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward\n hidden_states = self.mlp(hidden_states)\n File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 92, in forward\n x, _ = self.gate_up_proj(x)\n File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward\n output_parallel = self.quant_method.apply(self, input, bias)\n File "/home/ktkj/uv_project/vllm_p/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply\n return F.linear(x, layer.weight, bias)\n\nSet TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information\n\n\nYou can suppress this exception and fall back to eager by setting:\n import torch._dynamo\n torch._dynamo.config.suppress_errors = True\n')
ERROR 04-18 14:33:10 [core.py:387]
CRITICAL 04-18 14:33:10 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
已杀死
You should directly install the latest source code to install, update the latest source code of vLLM, as a PR about fixing this content was merged yesterday.