W8A8-fp8 version

#12
by OveJie - opened

I tried to process w8a8-fp8 using the llmcompressor execution method of this model RedHatAI/gemma-3-27b-it-FP8-dynamic , but I don't know why the resulting weight file is problematic and cannot be used in vllm v0.9.2.

In vllm v0.9.2 already fix this bug

INFO 07-17 05:42:13 [gpu_model_runner.py:1770] Starting to load model /models/gemma-3-27b-it-abliterated-W8A8-fp8...
INFO 07-17 05:42:14 [gpu_model_runner.py:1775] Loading model from scratch...
INFO 07-17 05:42:14 [cuda.py:287] Using FlexAttention backend on V1 engine.
INFO 07-17 05:42:14 [cuda.py:284] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/6 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  17% Completed | 1/6 [00:00<00:03,  1.48it/s]
Loading safetensors checkpoint shards:  33% Completed | 2/6 [00:01<00:02,  1.39it/s]
ERROR 07-17 05:42:17 [core.py:586] EngineCore failed to start.
ERROR 07-17 05:42:17 [core.py:586] Traceback (most recent call last):
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
ERROR 07-17 05:42:17 [core.py:586]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-17 05:42:17 [core.py:586]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
ERROR 07-17 05:42:17 [core.py:586]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__
ERROR 07-17 05:42:17 [core.py:586]     self.model_executor = executor_class(vllm_config)
ERROR 07-17 05:42:17 [core.py:586]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Process EngineCore_0:
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-17 05:42:17 [core.py:586]     self._init_executor()
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 07-17 05:42:17 [core.py:586]     self.collective_rpc("load_model")
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-17 05:42:17 [core.py:586]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-17 05:42:17 [core.py:586]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2736, in run_method
ERROR 07-17 05:42:17 [core.py:586]     return func(*args, **kwargs)
ERROR 07-17 05:42:17 [core.py:586]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model
ERROR 07-17 05:42:17 [core.py:586]     self.model_runner.load_model()
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1776, in load_model
ERROR 07-17 05:42:17 [core.py:586]     self.model = model_loader.load_model(
ERROR 07-17 05:42:17 [core.py:586]                  ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
ERROR 07-17 05:42:17 [core.py:586]     self.load_weights(model, model_config)
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
ERROR 07-17 05:42:17 [core.py:586]     loaded_weights = model.load_weights(
ERROR 07-17 05:42:17 [core.py:586]                      ^^^^^^^^^^^^^^^^^^^
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 720, in load_weights
ERROR 07-17 05:42:17 [core.py:586]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
ERROR 07-17 05:42:17 [core.py:586]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
ERROR 07-17 05:42:17 [core.py:586]     autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 07-17 05:42:17 [core.py:586]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
ERROR 07-17 05:42:17 [core.py:586]     yield from self._load_module(prefix,
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
ERROR 07-17 05:42:17 [core.py:586]     loaded_params = module_load_weights(weights)
ERROR 07-17 05:42:17 [core.py:586]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-17 05:42:17 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 519, in load_weights
ERROR 07-17 05:42:17 [core.py:586]     param = params_dict[name]
ERROR 07-17 05:42:17 [core.py:586]             ~~~~~~~~~~~^^^^^^
ERROR 07-17 05:42:17 [core.py:586] KeyError: 'vision_model.encoder.layers.0.mlp.fc1.weight_scale'

https://github.com/vllm-project/llm-compressor/issues/1306#issuecomment-2942874690
use it can run W8A8-fp8 ,But the entire model cannot be used.

OveJie changed discussion status to closed
OveJie changed discussion status to open

Sign up or log in to comment