simon-mo commited on
Commit
11b897d
·
verified ·
1 Parent(s): 57c8978

Remove vLLM FP8 Limitation

Browse files

This has been fixed as of latest v0.8.5 release 🙇

Files changed (1) hide show
  1. README.md +0 -23
README.md CHANGED
@@ -114,29 +114,6 @@ You can use the Qwen3-235B-A22B-FP8 model with serveral inference frameworks, in
114
  However, please pay attention to the following known issues:
115
  - `transformers`:
116
  - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
117
- - vLLM:
118
- - there are currently compatibility issues with `vllm`. For a quick fix, you should make the following changes to `vllm/vllm/model_executor/layers/linear.py`:
119
- ```python
120
- # these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
121
- ...
122
- shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
123
- shard_size = self._get_shard_size_mapping(loaded_shard_id)
124
-
125
- # add the following code
126
- if isinstance(param, BlockQuantScaleParameter):
127
- weight_block_size = self.quant_method.quant_config.weight_block_size
128
- block_n, _ = weight_block_size[0], weight_block_size[1]
129
- shard_offset = (shard_offset + block_n - 1) // block_n
130
- shard_size = (shard_size + block_n - 1) // block_n
131
- # end of the modification
132
-
133
- param.load_qkv_weight(loaded_weight=loaded_weight,
134
- num_heads=self.num_kv_head_replicas,
135
- shard_id=loaded_shard_id,
136
- shard_offset=shard_offset,
137
- shard_size=shard_size)
138
- ...
139
- ```
140
 
141
  ## Switching Between Thinking and Non-Thinking Mode
142
 
 
114
  However, please pay attention to the following known issues:
115
  - `transformers`:
116
  - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  ## Switching Between Thinking and Non-Thinking Mode
119