JunHowie commited on
Commit
293e121
·
verified ·
1 Parent(s): 50520c6

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
.mdl ADDED
Binary file (55 Bytes). View file
 
.msc ADDED
Binary file (1.89 kB). View file
 
.mv ADDED
@@ -0,0 +1 @@
 
 
1
+ Revision:master,CreatedAt:1753851721
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - glm4_moe
6
+ - AWQ
7
+ - FP16Mix
8
+ - 量化修复
9
+ - vLLM
10
+ base_model:
11
+ - ZhipuAI/GLM-4.5-Air
12
+ base_model_relation: quantized
13
+ ---
14
+ # GLM-4.5-Air-AWQ-FP16Mix
15
+ 基础型 [ZhipuAI/GLM-4.5-Air](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5-Air)
16
+
17
+
18
+ ### 【Vllm 单机8卡启动命令】
19
+ <i>注: 启动该模型一定要跟`--enable-expert-parallel` ,否则其专家张量TP整除除不尽;即使是2卡也需要。 </i>
20
+ ```
21
+ $CONTEXT_LENGTH=32768
22
+
23
+ vllm serve \
24
+ tclf90/GLM-4.5-Air-AWQ-FP16Mix \
25
+ --served-model-name GLM-4.5-Air-AWQ-FP16Mix \
26
+ --enable-expert-parallel \
27
+ --swap-space 16 \
28
+ --max-num-seqs 512 \
29
+ --max-model-len $CONTEXT_LENGTH \
30
+ --max-seq-len-to-capture $CONTEXT_LENGTH \
31
+ --gpu-memory-utilization 0.9 \
32
+ --tensor-parallel-size 8 \
33
+ --trust-remote-code \
34
+ --disable-log-requests \
35
+ --host 0.0.0.0 \
36
+ --port 8000
37
+ ```
38
+
39
+ ### 【依赖】
40
+
41
+ ```
42
+ vllm==0.10.0
43
+ ```
44
+
45
+ ### 【❗❗vllm==0.10.0 临时补丁❗❗】
46
+ `vllm`内`awq_marlin` 在加载 awq moe模型的时候,遗漏检查 `modules_to_not_convert` 参数,导致moe的混合量化不生效/报错 [[Issue #21888]](https://github.com/vllm-project/vllm/pull/21888)。
47
+
48
+ PR merge之前,先临时将 `awq_marlin.py` 替换置 `vllm/model_executor/layers/quantization/awq_marlin.py`
49
+
50
+ ### 【模型更新日期】
51
+ ```
52
+ 2025-07-30
53
+ 1. 首次commit
54
+ ```
55
+
56
+ ### 【模型列表】
57
+
58
+ | 文件大小 | 最近更新时间 |
59
+ |--------|--------------|
60
+ | `69GB` | `2025-07-30` |
61
+
62
+
63
+
64
+ ### 【模型下载】
65
+
66
+ ```python
67
+ from modelscope import snapshot_download
68
+ snapshot_download('tclf90/GLM-4.5-Air-AWQ-FP16Mix', cache_dir="本地路径")
69
+ ```
70
+
71
+
72
+ ### 【介绍】
73
+
74
+ # GLM-4.5
75
+
76
+ <div align="center">
77
+ <img src=https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg width="15%"/>
78
+ </div>
79
+ <p align="center">
80
+ 👋 加入我们的<a href="https://github.com/zai-org/GLM-4.5/blob/main/resources/WECHAT.md" target="_blank"> 微信群 </a>。
81
+ <br>
82
+ 📖 查看GLM-4.5<a href="https://z.ai/blog/glm-4.5" target="_blank"> 技术博客 </a>。
83
+ <br>
84
+ 📍 在<a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-4.5"> 智谱AI开放平台 </a>上使用GLM-4.5 API服务。
85
+ <br>
86
+ 👉 一键体验 <a href="https://chat.z.ai" >GLM-4.5 </a>。
87
+ </p>
88
+
89
+ ## 模型介绍
90
+
91
+ **GLM-4.5** 系列模型是专为智能体设计的基础模型。GLM-4.5拥有 **3550** 亿总参数量,其中 **320** 亿活跃参数;GLM-4.5-Air采用更紧凑的设计,拥有
92
+ **1060** 亿总参数量,其中 **120** 亿活跃参数。GLM-4.5模型统一了推理、编码和智能体能力,以满足智能体应用的复杂需求。
93
+
94
+ GLM-4.5 和 GLM-4.5-Air 都是混合推理模型,提供两种模式:用于复杂推理和工具使用的思考模式,以及用于即时响应的非思考模式。
95
+
96
+ 我们已开源了 GLM-4.5 和 GLM-4.5-Air 的基础模型、混合推理模型以及混合推理模型的FP8版本。它们采用MIT开源许可证发布,可用于商业用途和二次开发。
97
+
98
+ 在我们对12项行业标准基准的全面评估中,GLM-4.5表现卓越,得分 **63.2**,在所有专有和开源模型中排名**第3**
99
+ 。值得注意的是,GLM-4.5-Air在保持优异效率的同时,仍取得了 **59.8** 的竞争性成绩。
100
+
101
+ ![bench](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/bench.png)
102
+
103
+ 如需了解更多评估结果、展示案例和技术细节,请访问我们的 [技术博客](https://z.ai/blog/glm-4.5)。技术报告将很快发布。
104
+
105
+ 模型代码、工具解析器和推理解析器可在 [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4_moe)、 [vLLM](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4_moe_mtp.py)
106
+ 和 [SGLang](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/glm4_moe.py) 的实现中找到。
107
+
108
+ ## 快速开始
109
+
110
+ 请参考我们的[github](https://github.com/zai-org/GLM-4.5)项目。
awq_marlin.py ADDED
@@ -0,0 +1,537 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPDX-License-Identifier: Apache-2.0
2
+ # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3
+
4
+ from copy import deepcopy
5
+ from typing import Any, Callable, Optional
6
+
7
+ import torch
8
+ from torch.nn import Parameter
9
+
10
+ import vllm.model_executor.layers.fused_moe # noqa
11
+ from vllm import _custom_ops as ops
12
+ from vllm.logger import init_logger
13
+ from vllm.model_executor.layers.fused_moe.layer import (
14
+ FusedMoE, FusedMoEMethodBase, FusedMoeWeightScaleSupported,
15
+ UnquantizedFusedMoEMethod)
16
+ from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
17
+ UnquantizedLinearMethod,
18
+ set_weight_attrs)
19
+ from vllm.model_executor.layers.quantization import QuantizationMethods
20
+ from vllm.model_executor.layers.quantization.awq import (AWQConfig,
21
+ is_layer_skipped_awq)
22
+ from vllm.model_executor.layers.quantization.base_config import (
23
+ QuantizationConfig, QuantizeMethodBase)
24
+ from vllm.model_executor.layers.quantization.utils import replace_parameter
25
+ from vllm.model_executor.layers.quantization.utils.marlin_utils import (
26
+ apply_awq_marlin_linear, awq_to_marlin_zero_points, check_marlin_supported,
27
+ check_marlin_supports_layer, check_moe_marlin_supports_layer,
28
+ marlin_make_empty_g_idx, marlin_make_workspace_new,
29
+ marlin_moe_permute_scales, marlin_permute_scales,
30
+ moe_awq_to_marlin_zero_points, verify_marlin_supported,
31
+ verify_marlin_supports_shape)
32
+ from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
33
+ from vllm.model_executor.parameter import (GroupQuantScaleParameter,
34
+ PackedvLLMParameter)
35
+ from vllm.platforms import current_platform
36
+ from vllm.scalar_type import scalar_types
37
+
38
+ logger = init_logger(__name__)
39
+
40
+
41
+ def get_moe_quant_method(
42
+ config: QuantizationConfig,
43
+ layer: torch.nn.Module,
44
+ prefix: str,
45
+ moe_method_cls: type,
46
+ ):
47
+ if isinstance(layer, FusedMoE) and is_layer_skipped_awq(prefix, getattr(config, "modules_to_not_convert", [])):
48
+ return UnquantizedFusedMoEMethod(layer.moe_config)
49
+ return moe_method_cls(config)
50
+
51
+
52
+ class AWQMarlinConfig(QuantizationConfig):
53
+ """Config class for AWQ Marlin"""
54
+
55
+ # num_bits -> type
56
+ TYPE_MAP = {
57
+ 4: scalar_types.uint4,
58
+ 8: scalar_types.uint8,
59
+ }
60
+
61
+ def __init__(self, weight_bits: int, group_size: int, zero_point: bool,
62
+ lm_head_quantized: bool,
63
+ modules_to_not_convert: Optional[list[str]],
64
+ full_config: dict[str, Any]) -> None:
65
+ super().__init__()
66
+ self.pack_factor = 32 // weight_bits # packed into int32
67
+ self.group_size = group_size
68
+ self.zero_point = zero_point
69
+ self.lm_head_quantized = lm_head_quantized
70
+ self.weight_bits = weight_bits
71
+ self.modules_to_not_convert = modules_to_not_convert or []
72
+ self.full_config = full_config
73
+
74
+ if self.weight_bits not in self.TYPE_MAP:
75
+ raise ValueError(f"Unsupported num_bits = {self.weight_bits}. "
76
+ f"Supported num_bits = {self.TYPE_MAP.keys()}")
77
+
78
+ self.quant_type = self.TYPE_MAP[self.weight_bits]
79
+
80
+ verify_marlin_supported(self.quant_type,
81
+ group_size=self.group_size,
82
+ has_zp=self.zero_point)
83
+
84
+ def __repr__(self) -> str:
85
+ return (f"AWQMarlinConfig(quant_type={self.quant_type}, "
86
+ f"group_size={self.group_size}, "
87
+ f"zero_point={self.zero_point}, "
88
+ f"lm_head_quantized={self.lm_head_quantized}, "
89
+ f"modules_to_not_convert={self.modules_to_not_convert})")
90
+
91
+ @classmethod
92
+ def get_name(cls) -> QuantizationMethods:
93
+ return "awq_marlin"
94
+
95
+ @classmethod
96
+ def get_supported_act_dtypes(cls) -> list[torch.dtype]:
97
+ return [torch.half, torch.bfloat16]
98
+
99
+ @classmethod
100
+ def get_min_capability(cls) -> int:
101
+ return 80
102
+
103
+ @classmethod
104
+ def get_config_filenames(cls) -> list[str]:
105
+ return ["quantize_config.json"]
106
+
107
+ @classmethod
108
+ def from_config(cls, config: dict[str, Any]) -> "AWQMarlinConfig":
109
+ weight_bits = cls.get_from_keys(config, ["bits"])
110
+ group_size = cls.get_from_keys(config, ["group_size"])
111
+ zero_point = cls.get_from_keys(config, ["zero_point"])
112
+ lm_head_quantized = cls.get_from_keys_or(config, ["lm_head"],
113
+ default=False)
114
+ modules_to_not_convert = cls.get_from_keys_or(
115
+ config, ["modules_to_not_convert"], None)
116
+ return cls(weight_bits, group_size, zero_point, lm_head_quantized,
117
+ modules_to_not_convert, config)
118
+
119
+ @classmethod
120
+ def override_quantization_method(
121
+ cls, hf_quant_cfg, user_quant) -> Optional[QuantizationMethods]:
122
+ can_convert = cls.is_awq_marlin_compatible(hf_quant_cfg)
123
+ is_valid_user_quant = (user_quant is None or user_quant == "marlin"
124
+ or user_quant == "awq_marlin")
125
+
126
+ if can_convert and is_valid_user_quant:
127
+ msg = ("The model is convertible to {} during runtime."
128
+ " Using {} kernel.".format(cls.get_name(), cls.get_name()))
129
+ logger.info(msg)
130
+ return cls.get_name()
131
+
132
+ if can_convert and user_quant == "awq":
133
+ logger.info("Detected that the model can run with awq_marlin"
134
+ ", however you specified quantization=awq explicitly,"
135
+ " so forcing awq. Use quantization=awq_marlin for"
136
+ " faster inference")
137
+ return None
138
+
139
+ def get_quant_method(self, layer: torch.nn.Module,
140
+ prefix: str) -> Optional["QuantizeMethodBase"]:
141
+ if (isinstance(layer, LinearBase) or
142
+ (isinstance(layer, ParallelLMHead) and self.lm_head_quantized)):
143
+ if is_layer_skipped_awq(prefix, self.modules_to_not_convert):
144
+ return UnquantizedLinearMethod()
145
+ # Check if the layer is supported by AWQMarlin.
146
+ if not check_marlin_supports_layer(layer, self.group_size):
147
+ logger.warning_once(
148
+ "Layer '%s' is not supported by AWQMarlin. Falling back to unoptimized AWQ kernels.", # noqa: E501
149
+ prefix,
150
+ )
151
+ return AWQConfig.from_config(
152
+ self.full_config).get_quant_method(layer, prefix)
153
+ return AWQMarlinLinearMethod(self)
154
+ elif isinstance(layer, FusedMoE):
155
+ from vllm.model_executor.layers.quantization.moe_wna16 import (
156
+ MoeWNA16Config)
157
+ if not check_moe_marlin_supports_layer(layer, self.group_size):
158
+ logger.warning_once(
159
+ f"Layer '{prefix}' is not supported by AWQMoeMarlin. "
160
+ "Falling back to Moe WNA16 kernels.")
161
+ return MoeWNA16Config.from_config(
162
+ self.full_config).get_quant_method(layer, prefix)
163
+ return get_moe_quant_method(self, layer, prefix,
164
+ AWQMoEMethod)
165
+ return None
166
+
167
+ @classmethod
168
+ def is_awq_marlin_compatible(cls, quant_config: dict[str, Any]):
169
+ # Extract data from quant config.
170
+ quant_method = quant_config.get("quant_method", "").lower()
171
+ num_bits = quant_config.get("bits")
172
+ group_size = quant_config.get("group_size")
173
+ zero_point = quant_config.get("zero_point")
174
+
175
+ if not current_platform.is_cuda():
176
+ return False
177
+
178
+ if quant_method != "awq":
179
+ return False
180
+
181
+ # If we cannot find the info needed in the config, cannot convert.
182
+ if (num_bits is None or group_size is None or zero_point is None):
183
+ return False
184
+
185
+ if num_bits not in cls.TYPE_MAP:
186
+ return False
187
+
188
+ return check_marlin_supported(quant_type=cls.TYPE_MAP[num_bits],
189
+ group_size=group_size,
190
+ has_zp=zero_point)
191
+
192
+
193
+ class AWQMarlinLinearMethod(LinearMethodBase):
194
+ """Linear method for AWQ Marlin.
195
+
196
+ Args:
197
+ quant_config: The AWQ Marlin quantization config.
198
+ """
199
+
200
+ def __init__(self, quant_config: AWQMarlinConfig) -> None:
201
+ self.quant_config = quant_config
202
+
203
+ def create_weights(
204
+ self,
205
+ layer: torch.nn.Module,
206
+ input_size_per_partition: int,
207
+ output_partition_sizes: list[int],
208
+ input_size: int,
209
+ output_size: int,
210
+ params_dtype: torch.dtype,
211
+ **extra_weight_attrs,
212
+ ) -> None:
213
+ del output_size
214
+ output_size_per_partition = sum(output_partition_sizes)
215
+ weight_loader = extra_weight_attrs.get("weight_loader")
216
+
217
+ # Normalize group_size
218
+ if self.quant_config.group_size != -1:
219
+ group_size = self.quant_config.group_size
220
+ else:
221
+ group_size = input_size
222
+
223
+ verify_marlin_supports_shape(
224
+ output_size_per_partition=output_size_per_partition,
225
+ input_size_per_partition=input_size_per_partition,
226
+ input_size=input_size,
227
+ group_size=group_size)
228
+
229
+ qweight = PackedvLLMParameter(
230
+ data=torch.empty(
231
+ input_size_per_partition,
232
+ output_size_per_partition // self.quant_config.pack_factor,
233
+ dtype=torch.int32,
234
+ ),
235
+ input_dim=0,
236
+ output_dim=1,
237
+ packed_dim=1,
238
+ packed_factor=self.quant_config.pack_factor,
239
+ weight_loader=weight_loader)
240
+
241
+ num_groups = input_size_per_partition // group_size
242
+
243
+ qzeros = PackedvLLMParameter(
244
+ data=torch.empty(
245
+ num_groups,
246
+ output_size_per_partition // self.quant_config.pack_factor,
247
+ dtype=torch.int32,
248
+ ),
249
+ input_dim=0,
250
+ output_dim=1,
251
+ packed_dim=1,
252
+ packed_factor=self.quant_config.pack_factor,
253
+ weight_loader=weight_loader)
254
+
255
+ scales = GroupQuantScaleParameter(data=torch.empty(
256
+ num_groups,
257
+ output_size_per_partition,
258
+ dtype=params_dtype,
259
+ ),
260
+ input_dim=0,
261
+ output_dim=1,
262
+ weight_loader=weight_loader)
263
+
264
+ layer.register_parameter("qweight", qweight)
265
+ layer.register_parameter("qzeros", qzeros)
266
+ layer.register_parameter("scales", scales)
267
+
268
+ layer.input_size_per_partition = input_size_per_partition
269
+ layer.output_size_per_partition = output_size_per_partition
270
+ layer.num_groups = num_groups
271
+
272
+ # TODO: Update this docs
273
+ # Checkpoints are serialized in AutoAWQ format, which is different from the
274
+ # marlin format. This function is called after the weights are loaded.
275
+ # Here, we handle the repacking
276
+ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
277
+ device = layer.qweight.device
278
+ layer.qweight = torch.nn.Parameter(layer.qweight.data,
279
+ requires_grad=False)
280
+ layer.qzeros = torch.nn.Parameter(layer.qzeros.data,
281
+ requires_grad=False)
282
+ layer.scales = torch.nn.Parameter(layer.scales.data,
283
+ requires_grad=False)
284
+
285
+ # Allocate marlin workspace
286
+ layer.workspace = marlin_make_workspace_new(device)
287
+
288
+ # Repack weights from AWQ format to marlin format.
289
+ marlin_qweight = ops.awq_marlin_repack(
290
+ layer.qweight,
291
+ size_k=layer.input_size_per_partition,
292
+ size_n=layer.output_size_per_partition,
293
+ num_bits=self.quant_config.quant_type.size_bits)
294
+ replace_parameter(layer, "qweight", marlin_qweight)
295
+
296
+ # Permute scales from AWQ format to marlin format.
297
+ marlin_scales = marlin_permute_scales(
298
+ layer.scales,
299
+ size_k=layer.input_size_per_partition,
300
+ size_n=layer.output_size_per_partition,
301
+ group_size=self.quant_config.group_size)
302
+ replace_parameter(layer, "scales", marlin_scales)
303
+
304
+ # Permute zero-points from AWQ format to marlin format.
305
+ marlin_zp = awq_to_marlin_zero_points(
306
+ layer.qzeros,
307
+ size_k=layer.num_groups,
308
+ size_n=layer.output_size_per_partition,
309
+ num_bits=self.quant_config.quant_type.size_bits)
310
+ replace_parameter(layer, "qzeros", marlin_zp)
311
+
312
+ # Not-used
313
+ layer.g_idx = marlin_make_empty_g_idx(device)
314
+ layer.g_idx_sort_indices = marlin_make_empty_g_idx(device)
315
+
316
+ def apply(
317
+ self,
318
+ layer: torch.nn.Module,
319
+ x: torch.Tensor,
320
+ bias: Optional[torch.Tensor] = None,
321
+ ) -> torch.Tensor:
322
+ return apply_awq_marlin_linear(
323
+ input=x,
324
+ weight=layer.qweight,
325
+ weight_scale=layer.scales,
326
+ weight_zp=layer.qzeros,
327
+ g_idx=layer.g_idx,
328
+ g_idx_sort_indices=layer.g_idx_sort_indices,
329
+ workspace=layer.workspace,
330
+ quant_type=self.quant_config.quant_type,
331
+ output_size_per_partition=layer.output_size_per_partition,
332
+ input_size_per_partition=layer.input_size_per_partition,
333
+ bias=bias)
334
+
335
+
336
+ class AWQMoEMethod(FusedMoEMethodBase):
337
+
338
+ def __init__(self, quant_config: AWQMarlinConfig):
339
+ self.quant_config = quant_config
340
+ if self.quant_config.weight_bits != 4:
341
+ raise ValueError("AWQMoEMethod only supports 4bit now.")
342
+ self.quant_type = scalar_types.uint4
343
+
344
+ def create_weights(self, layer: torch.nn.Module, num_experts: int,
345
+ hidden_size: int, intermediate_size_per_partition: int,
346
+ params_dtype: torch.dtype, **extra_weight_attrs):
347
+ extra_weight_attrs.update({
348
+ "is_transposed":
349
+ True,
350
+ "quant_method":
351
+ FusedMoeWeightScaleSupported.GROUP.value,
352
+ })
353
+
354
+ w13_qweight = Parameter(
355
+ torch.empty(num_experts,
356
+ hidden_size,
357
+ 2 * intermediate_size_per_partition //
358
+ self.quant_config.pack_factor,
359
+ dtype=torch.int32),
360
+ requires_grad=False)
361
+ layer.register_parameter("w13_qweight", w13_qweight)
362
+ set_weight_attrs(w13_qweight, extra_weight_attrs)
363
+
364
+ w2_qweight = Parameter(torch.empty(num_experts,
365
+ intermediate_size_per_partition,
366
+ hidden_size //
367
+ self.quant_config.pack_factor,
368
+ dtype=torch.int32),
369
+ requires_grad=False)
370
+ layer.register_parameter("w2_qweight", w2_qweight)
371
+ set_weight_attrs(w2_qweight, extra_weight_attrs)
372
+
373
+ num_groups_w13 = hidden_size // self.quant_config.group_size
374
+ num_groups_w2 = (intermediate_size_per_partition //
375
+ self.quant_config.group_size)
376
+
377
+ # WEIGHT_SCALES
378
+ # Allocate 2 scales for w1 and w3 respectively.
379
+ w13_scales = Parameter(torch.empty(num_experts,
380
+ num_groups_w13,
381
+ intermediate_size_per_partition * 2,
382
+ dtype=params_dtype),
383
+ requires_grad=False)
384
+ layer.register_parameter("w13_scales", w13_scales)
385
+ set_weight_attrs(w13_scales, extra_weight_attrs)
386
+
387
+ w2_scales = Parameter(torch.empty(num_experts,
388
+ num_groups_w2,
389
+ hidden_size,
390
+ dtype=params_dtype),
391
+ requires_grad=False)
392
+ layer.register_parameter("w2_scales", w2_scales)
393
+ set_weight_attrs(w2_scales, extra_weight_attrs)
394
+
395
+ # WEIGHT_ZERO_POINT
396
+ # Allocate 2 zero points for w1 and w3 respectively.
397
+ w13_qzeros = Parameter(
398
+ torch.empty(num_experts,
399
+ num_groups_w13,
400
+ 2 * intermediate_size_per_partition //
401
+ self.quant_config.pack_factor,
402
+ dtype=torch.int32),
403
+ requires_grad=False)
404
+ layer.register_parameter("w13_qzeros", w13_qzeros)
405
+ set_weight_attrs(w13_qzeros, extra_weight_attrs)
406
+
407
+ w2_qzeros = Parameter(torch.empty(num_experts,
408
+ num_groups_w2,
409
+ hidden_size //
410
+ self.quant_config.pack_factor,
411
+ dtype=torch.int32),
412
+ requires_grad=False)
413
+ layer.register_parameter("w2_qzeros", w2_qzeros)
414
+ set_weight_attrs(w2_qzeros, extra_weight_attrs)
415
+
416
+ device = layer.w13_qweight.device
417
+ layer.workspace = marlin_make_workspace_new(device, 4)
418
+
419
+ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
420
+ num_experts = layer.w13_qweight.shape[0]
421
+ device = layer.w13_qweight.device
422
+
423
+ layer.w13_g_idx_sort_indices = torch.nn.Parameter(
424
+ torch.empty((num_experts, 0), dtype=torch.int32, device=device),
425
+ requires_grad=False,
426
+ )
427
+ layer.w2_g_idx_sort_indices = torch.nn.Parameter(
428
+ torch.empty((num_experts, 0), dtype=torch.int32, device=device),
429
+ requires_grad=False,
430
+ )
431
+
432
+ marlin_w13_qweight = ops.awq_marlin_moe_repack(
433
+ layer.w13_qweight,
434
+ layer.w13_g_idx_sort_indices,
435
+ size_k=layer.w13_qweight.shape[1],
436
+ size_n=layer.w13_qweight.shape[2] * self.quant_config.pack_factor,
437
+ num_bits=self.quant_config.weight_bits,
438
+ )
439
+ replace_parameter(layer, "w13_qweight", marlin_w13_qweight)
440
+
441
+ marlin_w2_qweight = ops.awq_marlin_moe_repack(
442
+ layer.w2_qweight,
443
+ layer.w2_g_idx_sort_indices,
444
+ size_k=layer.w2_qweight.shape[1],
445
+ size_n=layer.w2_qweight.shape[2] * self.quant_config.pack_factor,
446
+ num_bits=self.quant_config.weight_bits,
447
+ )
448
+ replace_parameter(layer, "w2_qweight", marlin_w2_qweight)
449
+
450
+ # Why does this take the intermediate size for size_k?
451
+ marlin_w13_scales = marlin_moe_permute_scales(
452
+ s=layer.w13_scales,
453
+ size_k=layer.intermediate_size_per_partition,
454
+ size_n=layer.w13_scales.shape[2],
455
+ group_size=self.quant_config.group_size,
456
+ )
457
+
458
+ replace_parameter(layer, "w13_scales", marlin_w13_scales)
459
+
460
+ marlin_w2_scales = marlin_moe_permute_scales(
461
+ s=layer.w2_scales,
462
+ size_k=layer.intermediate_size_per_partition,
463
+ size_n=layer.w2_scales.shape[2],
464
+ group_size=self.quant_config.group_size,
465
+ )
466
+ replace_parameter(layer, "w2_scales", marlin_w2_scales)
467
+
468
+ marlin_w13_zp = moe_awq_to_marlin_zero_points(
469
+ layer.w13_qzeros,
470
+ size_k=layer.w13_qzeros.shape[1],
471
+ size_n=layer.w13_qzeros.shape[2] * self.quant_config.pack_factor,
472
+ num_bits=self.quant_config.weight_bits)
473
+ replace_parameter(layer, "w13_qzeros", marlin_w13_zp)
474
+
475
+ marlin_w2_zp = moe_awq_to_marlin_zero_points(
476
+ layer.w2_qzeros,
477
+ size_k=layer.w2_qzeros.shape[1],
478
+ size_n=layer.w2_qzeros.shape[2] * self.quant_config.pack_factor,
479
+ num_bits=self.quant_config.weight_bits)
480
+ replace_parameter(layer, "w2_qzeros", marlin_w2_zp)
481
+
482
+ def apply(
483
+ self,
484
+ layer: torch.nn.Module,
485
+ x: torch.Tensor,
486
+ router_logits: torch.Tensor,
487
+ top_k: int,
488
+ renormalize: bool,
489
+ use_grouped_topk: bool = False,
490
+ topk_group: Optional[int] = None,
491
+ num_expert_group: Optional[int] = None,
492
+ global_num_experts: int = -1,
493
+ expert_map: Optional[torch.Tensor] = None,
494
+ custom_routing_function: Optional[Callable] = None,
495
+ scoring_func: str = "softmax",
496
+ e_score_correction_bias: Optional[torch.Tensor] = None,
497
+ apply_router_weight_on_input: bool = False,
498
+ activation: str = "silu",
499
+ enable_eplb: bool = False,
500
+ expert_load_view: Optional[torch.Tensor] = None,
501
+ logical_to_physical_map: Optional[torch.Tensor] = None,
502
+ logical_replica_count: Optional[torch.Tensor] = None,
503
+ ) -> torch.Tensor:
504
+ if enable_eplb:
505
+ raise NotImplementedError(
506
+ "EPLB not supported for `AWQMoEMethod` yet.")
507
+
508
+ assert activation == "silu", "Only SiLU activation is supported."
509
+
510
+ topk_weights, topk_ids = FusedMoE.select_experts(
511
+ hidden_states=x,
512
+ router_logits=router_logits,
513
+ use_grouped_topk=use_grouped_topk,
514
+ top_k=top_k,
515
+ renormalize=renormalize,
516
+ topk_group=topk_group,
517
+ num_expert_group=num_expert_group,
518
+ custom_routing_function=custom_routing_function,
519
+ scoring_func=scoring_func,
520
+ e_score_correction_bias=e_score_correction_bias)
521
+
522
+ return torch.ops.vllm.fused_marlin_moe(
523
+ x,
524
+ layer.w13_qweight,
525
+ layer.w2_qweight,
526
+ layer.w13_scales,
527
+ layer.w2_scales,
528
+ router_logits,
529
+ topk_weights,
530
+ topk_ids,
531
+ quant_type_id=self.quant_type.id,
532
+ apply_router_weight_on_input=apply_router_weight_on_input,
533
+ global_num_experts=global_num_experts,
534
+ expert_map=expert_map,
535
+ w1_zeros=layer.w13_qzeros,
536
+ w2_zeros=layer.w2_qzeros,
537
+ workspace=layer.workspace)
chat_template.jinja ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [gMASK]<sop>
2
+ {%- if tools -%}
3
+ <|system|>
4
+ # Tools
5
+
6
+ You may call one or more functions to assist with the user query.
7
+
8
+ You are provided with function signatures within <tools></tools> XML tags:
9
+ <tools>
10
+ {% for tool in tools %}
11
+ {{ tool | tojson(ensure_ascii=False) }}
12
+ {% endfor %}
13
+ </tools>
14
+
15
+ For each function call, output the function name and arguments within the following XML format:
16
+ <tool_call>{function-name}
17
+ <arg_key>{arg-key-1}</arg_key>
18
+ <arg_value>{arg-value-1}</arg_value>
19
+ <arg_key>{arg-key-2}</arg_key>
20
+ <arg_value>{arg-value-2}</arg_value>
21
+ ...
22
+ </tool_call>{%- endif -%}
23
+ {%- macro visible_text(content) -%}
24
+ {%- if content is string -%}
25
+ {{- content }}
26
+ {%- elif content is iterable and content is not mapping -%}
27
+ {%- for item in content -%}
28
+ {%- if item is mapping and item.type == 'text' -%}
29
+ {{- item.text }}
30
+ {%- elif item is string -%}
31
+ {{- item }}
32
+ {%- endif -%}
33
+ {%- endfor -%}
34
+ {%- else -%}
35
+ {{- content }}
36
+ {%- endif -%}
37
+ {%- endmacro -%}
38
+ {%- set ns = namespace(last_user_index=-1) %}
39
+ {%- for m in messages %}
40
+ {%- if m.role == 'user' %}
41
+ {% set ns.last_user_index = loop.index0 -%}
42
+ {%- endif %}
43
+ {%- endfor %}
44
+ {% for m in messages %}
45
+ {%- if m.role == 'user' -%}<|user|>
46
+ {{ visible_text(m.content) }}
47
+ {{- '/nothink' if (enable_thinking is defined and not enable_thinking and not visible_text(m.content).endswith("/nothink")) else '' -}}
48
+ {%- elif m.role == 'assistant' -%}
49
+ <|assistant|>
50
+ {%- set reasoning_content = '' %}
51
+ {%- set content = visible_text(m.content) %}
52
+ {%- if m.reasoning_content is string %}
53
+ {%- set reasoning_content = m.reasoning_content %}
54
+ {%- else %}
55
+ {%- if '</think>' in content %}
56
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
57
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
58
+ {%- endif %}
59
+ {%- endif %}
60
+ {%- if loop.index0 > ns.last_user_index and reasoning_content -%}
61
+ {{ '\n<think>' + reasoning_content.strip() + '</think>'}}
62
+ {%- else -%}
63
+ {{ '\n<think></think>' }}
64
+ {%- endif -%}
65
+ {%- if content.strip() -%}
66
+ {{ '\n' + content.strip() }}
67
+ {%- endif -%}
68
+ {% if m.tool_calls %}
69
+ {% for tc in m.tool_calls %}
70
+ {%- if tc.function %}
71
+ {%- set tc = tc.function %}
72
+ {%- endif %}
73
+ {{ '\n<tool_call>' + tc.name }}
74
+ {% set _args = tc.arguments %}
75
+ {% for k, v in _args.items() %}
76
+ <arg_key>{{ k }}</arg_key>
77
+ <arg_value>{{ v | tojson(ensure_ascii=False) if v is not string else v }}</arg_value>
78
+ {% endfor %}
79
+ </tool_call>{% endfor %}
80
+ {% endif %}
81
+ {%- elif m.role == 'tool' -%}
82
+ {%- if m.content is string -%}
83
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
84
+ {{- '<|observation|>' }}
85
+ {%- endif %}
86
+ {{- '\n<tool_response>\n' }}
87
+ {{- m.content }}
88
+ {{- '\n</tool_response>' }}
89
+ {%- else -%}
90
+ <|observation|>{% for tr in m.content %}
91
+
92
+ <tool_response>
93
+ {{ tr.output if tr.output is defined else tr }}
94
+ </tool_response>{% endfor -%}
95
+ {% endif -%}
96
+ {%- elif m.role == 'system' -%}
97
+ <|system|>
98
+ {{ visible_text(m.content) }}
99
+ {%- endif -%}
100
+ {%- endfor -%}
101
+ {%- if add_generation_prompt -%}
102
+ <|assistant|>{{- '\n<think></think>' if (enable_thinking is defined and not enable_thinking) else '' -}}
103
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name_or_path": "tclf90/GLM-4.5-Air-AWQ-FP16Mix",
3
+ "architectures": [
4
+ "Glm4MoeForCausalLM"
5
+ ],
6
+ "attention_bias": true,
7
+ "attention_dropout": 0.0,
8
+ "pad_token_id": 151329,
9
+ "eos_token_id": [
10
+ 151329,
11
+ 151336,
12
+ 151338
13
+ ],
14
+ "head_dim": 128,
15
+ "hidden_act": "silu",
16
+ "hidden_size": 4096,
17
+ "partial_rotary_factor": 0.5,
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 11264,
20
+ "max_position_embeddings": 131072,
21
+ "model_type": "glm4_moe",
22
+ "moe_intermediate_size": 1408,
23
+ "norm_topk_prob": true,
24
+ "num_attention_heads": 96,
25
+ "n_group": 1,
26
+ "topk_group": 1,
27
+ "n_routed_experts": 128,
28
+ "n_shared_experts": 1,
29
+ "routed_scaling_factor": 1.0,
30
+ "num_experts_per_tok": 8,
31
+ "first_k_dense_replace": 1,
32
+ "num_hidden_layers": 46,
33
+ "num_key_value_heads": 8,
34
+ "rms_norm_eps": 1e-05,
35
+ "rope_scaling": null,
36
+ "rope_theta": 1000000,
37
+ "num_nextn_predict_layers": 1,
38
+ "tie_word_embeddings": false,
39
+ "torch_dtype": "float16",
40
+ "transformers_version": "4.54.0",
41
+ "use_cache": true,
42
+ "use_qk_norm": false,
43
+ "vocab_size": 151552,
44
+ "quantization_config": {
45
+ "quant_method": "awq_marlin",
46
+ "bits": 4,
47
+ "group_size": 128,
48
+ "version": "gemm",
49
+ "zero_point": true,
50
+ "modules_to_not_convert": ["model.embed_tokens", "model.layers.0.", "model.layers.1.", "model.layers.45.", "model.layers.46.", "model.layers.1.mlp.shared_experts.", "model.layers.2.mlp.shared_experts.", "model.layers.3.mlp.shared_experts.", "model.layers.4.mlp.shared_experts.", "model.layers.5.mlp.shared_experts.", "model.layers.6.mlp.shared_experts.", "model.layers.7.mlp.shared_experts.", "model.layers.8.mlp.shared_experts.", "model.layers.9.mlp.shared_experts.", "model.layers.10.mlp.shared_experts.", "model.layers.11.mlp.shared_experts.", "model.layers.12.mlp.shared_experts.", "model.layers.13.mlp.shared_experts.", "model.layers.14.mlp.shared_experts.", "model.layers.15.mlp.shared_experts.", "model.layers.16.mlp.shared_experts.", "model.layers.17.mlp.shared_experts.", "model.layers.18.mlp.shared_experts.", "model.layers.19.mlp.shared_experts.", "model.layers.20.mlp.shared_experts.", "model.layers.21.mlp.shared_experts.", "model.layers.22.mlp.shared_experts.", "model.layers.23.mlp.shared_experts.", "model.layers.24.mlp.shared_experts.", "model.layers.25.mlp.shared_experts.", "model.layers.26.mlp.shared_experts.", "model.layers.27.mlp.shared_experts.", "model.layers.28.mlp.shared_experts.", "model.layers.29.mlp.shared_experts.", "model.layers.30.mlp.shared_experts.", "model.layers.31.mlp.shared_experts.", "model.layers.32.mlp.shared_experts.", "model.layers.33.mlp.shared_experts.", "model.layers.34.mlp.shared_experts.", "model.layers.35.mlp.shared_experts.", "model.layers.36.mlp.shared_experts.", "model.layers.37.mlp.shared_experts.", "model.layers.38.mlp.shared_experts.", "model.layers.39.mlp.shared_experts.", "model.layers.40.mlp.shared_experts.", "model.layers.41.mlp.shared_experts.", "model.layers.42.mlp.shared_experts.", "model.layers.43.mlp.shared_experts.", "model.layers.44.mlp.shared_experts.", "model.layers.45.mlp.shared_experts.", "model.layers.46.mlp.shared_experts.", "model.layers.46.shared_head.", "lm_head"]
51
+ }
52
+ }
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework":"Pytorch","task":"text-generation"}
model-00001-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1fc26a957fd5241a3646ad8d4af0ecfb6c4afb3d5a331e2a5fccba165292049a
3
+ size 4996541272
model-00002-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8636bec0729cc1d553c1033cfe05bdbf9b31321c24e834f9deb7bd44caef4f96
3
+ size 4998307520
model-00003-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a8c5c1b6b142cff5cf2c854ddba8dd5caeaea231a047aaee3bd073629a4713e
3
+ size 4999151480
model-00004-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e2b4c25d8591c75d2d5066bf68b3ef8c032d96f5c228a50c745d3c8dbe2d3ba
3
+ size 4999153696
model-00005-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c224ab8dffd72a759ac18067b76fb26081356b2011c533ebe0125a87be261483
3
+ size 4990763840
model-00006-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14d52bddbf5812b8262e12e40eb921ccc53b36e40a8ad591b6f12046e1a41f63
3
+ size 4997445712
model-00007-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2a1b1955fc16166ad1bbd33e027b91f0cbbf1667bba4180e230283f5143aa32
3
+ size 4998360800
model-00008-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f152246c1b439ecebf28559a2ac5508ebe6834a2e56549692e0ac21bb8939de
3
+ size 5000525936
model-00009-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e37674f17722dee14b814ae6d5b3672c6e3f25c6dedede1a5707fe63fd122824
3
+ size 4999156080
model-00010-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c8f5364deb67daceb52a427deccfb399d0fe8080c84a115ff1b9f02688010f48
3
+ size 4999156120
model-00011-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32b4de7acea1825f8963e287a589b5814e8510283aa38bc49865a76636d64d08
3
+ size 4999156120
model-00012-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8458af9bd76d476cc7c53b456064677ad9697590d447108339ef84cb80eda365
3
+ size 4999156120
model-00013-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ac2c934733de485a23aea34609888eeb13d3ef4fe620254425114c5b40d337c
3
+ size 4995004840
model-00014-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be46467218653568d2ebbb08a5ffb78c4c94b841426f151df6ee8675f0395392
3
+ size 4999719624
model-00015-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:77c8f54a9f3b2d23793cb870fb465bcee910af595db1d9d71adcd854ab793154
3
+ size 3056676280
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9340665016419c825c4bdabbcc9acc43b7ca2c68ce142724afa829abb1be5efd
3
+ size 19970699
tokenizer_config.json ADDED
@@ -0,0 +1,325 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "151329": {
4
+ "content": "<|endoftext|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "151330": {
12
+ "content": "[MASK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "151331": {
20
+ "content": "[gMASK]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "151332": {
28
+ "content": "[sMASK]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "151333": {
36
+ "content": "<sop>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "151334": {
44
+ "content": "<eop>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "151335": {
52
+ "content": "<|system|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "151336": {
60
+ "content": "<|user|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "151337": {
68
+ "content": "<|assistant|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "151338": {
76
+ "content": "<|observation|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "151339": {
84
+ "content": "<|begin_of_image|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "151340": {
92
+ "content": "<|end_of_image|>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "151341": {
100
+ "content": "<|begin_of_video|>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "151342": {
108
+ "content": "<|end_of_video|>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "151343": {
116
+ "content": "<|begin_of_audio|>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "151344": {
124
+ "content": "<|end_of_audio|>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "151345": {
132
+ "content": "<|begin_of_transcription|>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "151346": {
140
+ "content": "<|end_of_transcription|>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "151347": {
148
+ "content": "<|code_prefix|>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "151348": {
156
+ "content": "<|code_middle|>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "151349": {
164
+ "content": "<|code_suffix|>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "151350": {
172
+ "content": "<think>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": false
178
+ },
179
+ "151351": {
180
+ "content": "</think>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": false
186
+ },
187
+ "151352": {
188
+ "content": "<tool_call>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": false
194
+ },
195
+ "151353": {
196
+ "content": "</tool_call>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": false
202
+ },
203
+ "151354": {
204
+ "content": "<tool_response>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": false
210
+ },
211
+ "151355": {
212
+ "content": "</tool_response>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": false
218
+ },
219
+ "151356": {
220
+ "content": "<arg_key>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": false
226
+ },
227
+ "151357": {
228
+ "content": "</arg_key>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": false
234
+ },
235
+ "151358": {
236
+ "content": "<arg_value>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": false
242
+ },
243
+ "151359": {
244
+ "content": "</arg_value>",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": false
250
+ },
251
+ "151360": {
252
+ "content": "/nothink",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "151361": {
260
+ "content": "<|begin_of_box|>",
261
+ "lstrip": false,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": false
266
+ },
267
+ "151362": {
268
+ "content": "<|end_of_box|>",
269
+ "lstrip": false,
270
+ "normalized": false,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": false
274
+ },
275
+ "151363": {
276
+ "content": "<|image|>",
277
+ "lstrip": false,
278
+ "normalized": false,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": false
282
+ },
283
+ "151364": {
284
+ "content": "<|video|>",
285
+ "lstrip": false,
286
+ "normalized": false,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": false
290
+ }
291
+ },
292
+ "additional_special_tokens": [
293
+ "<|endoftext|>",
294
+ "[MASK]",
295
+ "[gMASK]",
296
+ "[sMASK]",
297
+ "<sop>",
298
+ "<eop>",
299
+ "<|system|>",
300
+ "<|user|>",
301
+ "<|assistant|>",
302
+ "<|observation|>",
303
+ "<|begin_of_image|>",
304
+ "<|end_of_image|>",
305
+ "<|begin_of_video|>",
306
+ "<|end_of_video|>",
307
+ "<|begin_of_audio|>",
308
+ "<|end_of_audio|>",
309
+ "<|begin_of_transcription|>",
310
+ "<|end_of_transcription|>",
311
+ "<|code_prefix|>",
312
+ "<|code_middle|>",
313
+ "<|code_suffix|>",
314
+ "/nothink"
315
+ ],
316
+ "clean_up_tokenization_spaces": false,
317
+ "do_lower_case": false,
318
+ "eos_token": "<|endoftext|>",
319
+ "extra_special_tokens": {},
320
+ "model_max_length": 128000,
321
+ "pad_token": "<|endoftext|>",
322
+ "padding_side": "left",
323
+ "remove_space": false,
324
+ "tokenizer_class": "PreTrainedTokenizer"
325
+ }