Qwen
/

Qwen3-4B-FP8

@@ -95,7 +95,7 @@ print("thinking content:", thinking_content)
 print("content:", content)
 ```
-For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or  to create an OpenAI-compatible API endpoint:
 - SGLang:
     ```shell
     python -m sglang.launch_server --model-path Qwen/Qwen3-4B-FP8 --reasoning-parser qwen3
@@ -105,39 +105,16 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or  to create
     vllm serve Qwen/Qwen3-4B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
     ```
-For local use, applications such as llama.cpp, Ollama, LMStudio, and MLX-LM have also supported Qwen3.
 ## Note on FP8
 For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
-You can use the Qwen3-4B-FP8 model with serveral inference frameworks, including `transformers`, `vllm`, and `sglang`, as the original bfloat16 model.
 However, please pay attention to the following known issues:
 - `transformers`:
     - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
-- vLLM:
-    - there are currently compatibility issues with `vllm`. For a quick fix, you should make the following changes to `vllm/vllm/model_executor/layers/linear.py`:
-        ```python
-        # these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
-        ...
-        shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
-        shard_size = self._get_shard_size_mapping(loaded_shard_id)
-        # add the following code
-        if isinstance(param, BlockQuantScaleParameter):
-            weight_block_size = self.quant_method.quant_config.weight_block_size
-            block_n, _ = weight_block_size[0], weight_block_size[1]
-            shard_offset = (shard_offset + block_n - 1) // block_n
-            shard_size = (shard_size + block_n - 1) // block_n
-        # end of the modification
-        param.load_qkv_weight(loaded_weight=loaded_weight,
-                                num_heads=self.num_kv_head_replicas,
-                                shard_id=loaded_shard_id,
-                                shard_offset=shard_offset,
-                                shard_size=shard_size)
-        ...
-        ```
 ## Switching Between Thinking and Non-Thinking Mode
@@ -311,7 +288,7 @@ YaRN is currently supported by several inference frameworks, e.g., `transformers
     {
         ...,
         "rope_scaling": {
-            "type": "yarn",
             "factor": 4.0,
             "original_max_position_embeddings": 32768
         }
@@ -323,12 +300,12 @@ YaRN is currently supported by several inference frameworks, e.g., `transformers
   For `vllm`, you can use
     ```shell
-    vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
     ```
   For `sglang`, you can use
     ```shell
-    python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
     ```
   For `llama-server` from `llama.cpp`, you can use

 print("content:", content)
 ```
+For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
 - SGLang:
     ```shell
     python -m sglang.launch_server --model-path Qwen/Qwen3-4B-FP8 --reasoning-parser qwen3
     vllm serve Qwen/Qwen3-4B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
     ```
+For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
 ## Note on FP8
 For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
+You can use the Qwen3-32B-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
 However, please pay attention to the following known issues:
 - `transformers`:
     - there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
 ## Switching Between Thinking and Non-Thinking Mode
     {
         ...,
         "rope_scaling": {
+            "rope_type": "yarn",
             "factor": 4.0,
             "original_max_position_embeddings": 32768
         }
   For `vllm`, you can use
     ```shell
+    vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
     ```
   For `sglang`, you can use
     ```shell
+    python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
     ```
   For `llama-server` from `llama.cpp`, you can use