intervitens commited on
Commit
2e174ea
·
verified ·
1 Parent(s): fb43af3

Upload folder using huggingface_hub

Browse files
Files changed (43) hide show
  1. LICENSE +34 -0
  2. Open Source Software Notice +219 -0
  3. README.md +156 -0
  4. README_EN.md +157 -0
  5. checklist.chk +38 -0
  6. config.json +34 -0
  7. configuration_pangu_moe.py +76 -0
  8. generation_config.json +13 -0
  9. model-00001-of-00029.safetensors +3 -0
  10. model-00002-of-00029.safetensors +3 -0
  11. model-00003-of-00029.safetensors +3 -0
  12. model-00004-of-00029.safetensors +3 -0
  13. model-00005-of-00029.safetensors +3 -0
  14. model-00006-of-00029.safetensors +3 -0
  15. model-00007-of-00029.safetensors +3 -0
  16. model-00008-of-00029.safetensors +3 -0
  17. model-00009-of-00029.safetensors +3 -0
  18. model-00010-of-00029.safetensors +3 -0
  19. model-00011-of-00029.safetensors +3 -0
  20. model-00012-of-00029.safetensors +3 -0
  21. model-00013-of-00029.safetensors +3 -0
  22. model-00014-of-00029.safetensors +3 -0
  23. model-00015-of-00029.safetensors +3 -0
  24. model-00016-of-00029.safetensors +3 -0
  25. model-00017-of-00029.safetensors +3 -0
  26. model-00018-of-00029.safetensors +3 -0
  27. model-00019-of-00029.safetensors +3 -0
  28. model-00020-of-00029.safetensors +3 -0
  29. model-00021-of-00029.safetensors +3 -0
  30. model-00022-of-00029.safetensors +3 -0
  31. model-00023-of-00029.safetensors +3 -0
  32. model-00024-of-00029.safetensors +3 -0
  33. model-00025-of-00029.safetensors +3 -0
  34. model-00026-of-00029.safetensors +3 -0
  35. model-00027-of-00029.safetensors +3 -0
  36. model-00028-of-00029.safetensors +3 -0
  37. model-00029-of-00029.safetensors +3 -0
  38. model.safetensors.index.json +0 -0
  39. modeling_pangu_moe.py +1004 -0
  40. special_tokens_map.json +23 -0
  41. tokenization_pangu_moe.py +273 -0
  42. tokenizer.model +3 -0
  43. tokenizer_config.json +1 -0
LICENSE ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Pangu Model License Agreement Version 1.0
2
+
3
+ This Pangu Model License Agreement Version 1.0 (the "Agreement") is a legal agreement between You and Huawei Technologies Co., Ltd. ("Huawei", "We" or "Us"), and it governs Your reproducing, use, modification, and distribution of Pangu as made available by Huawei under this Agreement.
4
+
5
+ By using, reproducing, modifying, distributing, performing or displaying any portion or element of Pangu, or otherwise accepting the terms of this Agreement, You agree to be bound by this Agreement.
6
+
7
+ 1. Definitions.
8
+ 1.1. “Pangu” or “Model” means Pangu large language models and software, including trained model weights, parameters (including optimizer states), accompanying source code and scripts released under this Agreement.
9
+ 1.2. “Derivative Model” means all (1) modifications to the Model, (2) works based on the Model, and (3) any other derivative works of the Model. For clarity, information or content results from operating or otherwise using the Model is not a Derivative Model.
10
+ 1.3. “You” or “Your” means an individual or Legal Entity exercising permissions granted by this Agreement and/or using the Model for any purpose.
11
+ 1.4. “Third Party” or “Third Parties” means individuals or legal entities that are not under common control with Us or You.
12
+
13
+ 2. License Grant. Subject to Your full compliance with the terms and conditions of this Agreement, We hereby grant to You a perpetual, worldwide, non-exclusive, non-transferable, no-charge, royalty-free license (except as stated in Section 3) to use, reproduce, modify, and distribute the Model.
14
+
15
+ 3. Conditions for License Grant. You represent and warrant that You will not, access, download, install, run, deploy, integrate, modify, or otherwise use the Model, directly or indirectly, within the European Union.
16
+
17
+
18
+ 4. Redistribution.
19
+ 4.1. If You distribute the Model or Derivative Model, You shall retain in Your distribution (1) a copy of this agreement, and (2) all copyright notices and other notices of origin included in the Model that are applicable to Your distribution.
20
+ 4.2. Further, if You distribute or make available to Third Parties a product or service (including another AI model) based on the Model, You are required to (1) display the acknowledgement “Powered by Pangu” and (2) include a trademark notice “Pangu is a trademark of Huawei Technologies Co., Ltd.” on related webpages, user manuals, product documentations or other advertising materials mentioning features of the Model.
21
+ 4.3. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for Derivative Model made by You as a whole, provided Your use, reproduction, and distribution of the Model otherwise complies with the terms and conditions of this Agreement.
22
+
23
+ 5. Ownership. We do not claim ownership to any information or content generated using the Model or Derivative Model that are made by You. You are solely responsible for evaluating the accuracy and appropriateness of such information or content for Your use case.
24
+
25
+ 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of Huawei, except as required for complying with Section 4.2.
26
+
27
+ 7. Indemnity. You will indemnify and hold harmless Huawei from and against any claim by any third party arising out of or related to Your use or distribution of the Model or Derivative Model made by You (e.g. a violation against Section 3). For avoidance of doubt, “third party” in this clause include supervisory authorities.
28
+
29
+ 8. THE MODEL IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, NONINFRINGEMENT, ACCURACY, OR THE ABSENCE OF LATENT OR OTHER DEFECTS OR ERRORS, WHETHER OR NOT DISCOVERABLE, ALL TO THE GREATEST EXTENT PERMISSIBLE UNDER APPLICABLE LAW.
30
+
31
+ 9. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MODEL, IN WHOLE OR IN PART, NO MATTER HOW IT’S CAUSED OR THE LEGAL THEORY IT IS BASED ON, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
32
+
33
+
34
+ END OF THE TERMS AND CONDITIONS
Open Source Software Notice ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ OPEN SOURCE SOFTWARE NOTICE
2
+
3
+ Please note we provide an open source software notice along with this product and/or this product firmware (in the following just “this product”). The open source software licenses are granted by the respective right holders. And the open source licenses prevail all other license information with regard to the respective open source software contained in the product, including but not limited to End User Software Licensing Agreement. This notice is provided on behalf of Huawei Technologies Co. Ltd. and any of its local subsidiaries which may have provided this product to you in your local country.
4
+
5
+ Warranty Disclaimer
6
+ THE OPEN SOURCE SOFTWARE IN THIS PRODUCT IS DISTRIBUTED IN THE HOPE THAT IT WILL BE USEFUL, BUT WITHOUT ANY WARRANTY, WITHOUT EVEN THE IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. SEE THE APPLICABLE LICENSES FOR MORE DETAILS.
7
+
8
+ Copyright Notice and License Texts
9
+
10
+ Software: transformers
11
+ Copyright notice:
12
+ Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Team. All rights reserved.
13
+
14
+ License Text:
15
+ ----------------------------------------
16
+
17
+ Copyright 2018- The Hugging Face team. All rights reserved.
18
+
19
+ Apache License
20
+ Version 2.0, January 2004
21
+ http://www.apache.org/licenses/
22
+
23
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
24
+
25
+ 1. Definitions.
26
+
27
+ "License" shall mean the terms and conditions for use, reproduction,
28
+ and distribution as defined by Sections 1 through 9 of this document.
29
+
30
+ "Licensor" shall mean the copyright owner or entity authorized by
31
+ the copyright owner that is granting the License.
32
+
33
+ "Legal Entity" shall mean the union of the acting entity and all
34
+ other entities that control, are controlled by, or are under common
35
+ control with that entity. For the purposes of this definition,
36
+ "control" means (i) the power, direct or indirect, to cause the
37
+ direction or management of such entity, whether by contract or
38
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
39
+ outstanding shares, or (iii) beneficial ownership of such entity.
40
+
41
+ "You" (or "Your") shall mean an individual or Legal Entity
42
+ exercising permissions granted by this License.
43
+
44
+ "Source" form shall mean the preferred form for making modifications,
45
+ including but not limited to software source code, documentation
46
+ source, and configuration files.
47
+
48
+ "Object" form shall mean any form resulting from mechanical
49
+ transformation or translation of a Source form, including but
50
+ not limited to compiled object code, generated documentation,
51
+ and conversions to other media types.
52
+
53
+ "Work" shall mean the work of authorship, whether in Source or
54
+ Object form, made available under the License, as indicated by a
55
+ copyright notice that is included in or attached to the work
56
+ (an example is provided in the Appendix below).
57
+
58
+ "Derivative Works" shall mean any work, whether in Source or Object
59
+ form, that is based on (or derived from) the Work and for which the
60
+ editorial revisions, annotations, elaborations, or other modifications
61
+ represent, as a whole, an original work of authorship. For the purposes
62
+ of this License, Derivative Works shall not include works that remain
63
+ separable from, or merely link (or bind by name) to the interfaces of,
64
+ the Work and Derivative Works thereof.
65
+
66
+ "Contribution" shall mean any work of authorship, including
67
+ the original version of the Work and any modifications or additions
68
+ to that Work or Derivative Works thereof, that is intentionally
69
+ submitted to Licensor for inclusion in the Work by the copyright owner
70
+ or by an individual or Legal Entity authorized to submit on behalf of
71
+ the copyright owner. For the purposes of this definition, "submitted"
72
+ means any form of electronic, verbal, or written communication sent
73
+ to the Licensor or its representatives, including but not limited to
74
+ communication on electronic mailing lists, source code control systems,
75
+ and issue tracking systems that are managed by, or on behalf of, the
76
+ Licensor for the purpose of discussing and improving the Work, but
77
+ excluding communication that is conspicuously marked or otherwise
78
+ designated in writing by the copyright owner as "Not a Contribution."
79
+
80
+ "Contributor" shall mean Licensor and any individual or Legal Entity
81
+ on behalf of whom a Contribution has been received by Licensor and
82
+ subsequently incorporated within the Work.
83
+
84
+ 2. Grant of Copyright License. Subject to the terms and conditions of
85
+ this License, each Contributor hereby grants to You a perpetual,
86
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
87
+ copyright license to reproduce, prepare Derivative Works of,
88
+ publicly display, publicly perform, sublicense, and distribute the
89
+ Work and such Derivative Works in Source or Object form.
90
+
91
+ 3. Grant of Patent License. Subject to the terms and conditions of
92
+ this License, each Contributor hereby grants to You a perpetual,
93
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
94
+ (except as stated in this section) patent license to make, have made,
95
+ use, offer to sell, sell, import, and otherwise transfer the Work,
96
+ where such license applies only to those patent claims licensable
97
+ by such Contributor that are necessarily infringed by their
98
+ Contribution(s) alone or by combination of their Contribution(s)
99
+ with the Work to which such Contribution(s) was submitted. If You
100
+ institute patent litigation against any entity (including a
101
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
102
+ or a Contribution incorporated within the Work constitutes direct
103
+ or contributory patent infringement, then any patent licenses
104
+ granted to You under this License for that Work shall terminate
105
+ as of the date such litigation is filed.
106
+
107
+ 4. Redistribution. You may reproduce and distribute copies of the
108
+ Work or Derivative Works thereof in any medium, with or without
109
+ modifications, and in Source or Object form, provided that You
110
+ meet the following conditions:
111
+
112
+ (a) You must give any other recipients of the Work or
113
+ Derivative Works a copy of this License; and
114
+
115
+ (b) You must cause any modified files to carry prominent notices
116
+ stating that You changed the files; and
117
+
118
+ (c) You must retain, in the Source form of any Derivative Works
119
+ that You distribute, all copyright, patent, trademark, and
120
+ attribution notices from the Source form of the Work,
121
+ excluding those notices that do not pertain to any part of
122
+ the Derivative Works; and
123
+
124
+ (d) If the Work includes a "NOTICE" text file as part of its
125
+ distribution, then any Derivative Works that You distribute must
126
+ include a readable copy of the attribution notices contained
127
+ within such NOTICE file, excluding those notices that do not
128
+ pertain to any part of the Derivative Works, in at least one
129
+ of the following places: within a NOTICE text file distributed
130
+ as part of the Derivative Works; within the Source form or
131
+ documentation, if provided along with the Derivative Works; or,
132
+ within a display generated by the Derivative Works, if and
133
+ wherever such third-party notices normally appear. The contents
134
+ of the NOTICE file are for informational purposes only and
135
+ do not modify the License. You may add Your own attribution
136
+ notices within Derivative Works that You distribute, alongside
137
+ or as an addendum to the NOTICE text from the Work, provided
138
+ that such additional attribution notices cannot be construed
139
+ as modifying the License.
140
+
141
+ You may add Your own copyright statement to Your modifications and
142
+ may provide additional or different license terms and conditions
143
+ for use, reproduction, or distribution of Your modifications, or
144
+ for any such Derivative Works as a whole, provided Your use,
145
+ reproduction, and distribution of the Work otherwise complies with
146
+ the conditions stated in this License.
147
+
148
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
149
+ any Contribution intentionally submitted for inclusion in the Work
150
+ by You to the Licensor shall be under the terms and conditions of
151
+ this License, without any additional terms or conditions.
152
+ Notwithstanding the above, nothing herein shall supersede or modify
153
+ the terms of any separate license agreement you may have executed
154
+ with Licensor regarding such Contributions.
155
+
156
+ 6. Trademarks. This License does not grant permission to use the trade
157
+ names, trademarks, service marks, or product names of the Licensor,
158
+ except as required for reasonable and customary use in describing the
159
+ origin of the Work and reproducing the content of the NOTICE file.
160
+
161
+ 7. Disclaimer of Warranty. Unless required by applicable law or
162
+ agreed to in writing, Licensor provides the Work (and each
163
+ Contributor provides its Contributions) on an "AS IS" BASIS,
164
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
165
+ implied, including, without limitation, any warranties or conditions
166
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
167
+ PARTICULAR PURPOSE. You are solely responsible for determining the
168
+ appropriateness of using or redistributing the Work and assume any
169
+ risks associated with Your exercise of permissions under this License.
170
+
171
+ 8. Limitation of Liability. In no event and under no legal theory,
172
+ whether in tort (including negligence), contract, or otherwise,
173
+ unless required by applicable law (such as deliberate and grossly
174
+ negligent acts) or agreed to in writing, shall any Contributor be
175
+ liable to You for damages, including any direct, indirect, special,
176
+ incidental, or consequential damages of any character arising as a
177
+ result of this License or out of the use or inability to use the
178
+ Work (including but not limited to damages for loss of goodwill,
179
+ work stoppage, computer failure or malfunction, or any and all
180
+ other commercial damages or losses), even if such Contributor
181
+ has been advised of the possibility of such damages.
182
+
183
+ 9. Accepting Warranty or Additional Liability. While redistributing
184
+ the Work or Derivative Works thereof, You may choose to offer,
185
+ and charge a fee for, acceptance of support, warranty, indemnity,
186
+ or other liability obligations and/or rights consistent with this
187
+ License. However, in accepting such obligations, You may act only
188
+ on Your own behalf and on Your sole responsibility, not on behalf
189
+ of any other Contributor, and only if You agree to indemnify,
190
+ defend, and hold each Contributor harmless for any liability
191
+ incurred by, or claims asserted against, such Contributor by reason
192
+ of your accepting any such warranty or additional liability.
193
+
194
+ END OF TERMS AND CONDITIONS
195
+
196
+ APPENDIX: How to apply the Apache License to your work.
197
+
198
+ To apply the Apache License to your work, attach the following
199
+ boilerplate notice, with the fields enclosed by brackets "[]"
200
+ replaced with your own identifying information. (Don't include
201
+ the brackets!) The text should be enclosed in the appropriate
202
+ comment syntax for the file format. We also recommend that a
203
+ file or class name and description of purpose be included on the
204
+ same "printed page" as the copyright notice for easier
205
+ identification within third-party archives.
206
+
207
+ Copyright [yyyy] [name of copyright owner]
208
+
209
+ Licensed under the Apache License, Version 2.0 (the "License");
210
+ you may not use this file except in compliance with the License.
211
+ You may obtain a copy of the License at
212
+
213
+ http://www.apache.org/licenses/LICENSE-2.0
214
+
215
+ Unless required by applicable law or agreed to in writing, software
216
+ distributed under the License is distributed on an "AS IS" BASIS,
217
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
218
+ See the License for the specific language governing permissions and
219
+ limitations under the License.
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Reuploaded from https://gitcode.com/ascend-tribe/pangu-pro-moe-model
2
+
3
+ # 盘古 Pro MoE:昇腾原生的分组混合专家模型
4
+
5
+ ## 模型简介
6
+
7
+ ![arch.PNG](https://raw.gitcode.com/ascend-tribe/pangu-pro-moe/blobs/7c83eb5c52ab91ba4bf2f8235ac1d0b1f9b49a7d/arch.PNG)
8
+
9
+ 我们提出了一种新型的分组混合专家模型(Mixture of Grouped Experts, MoGE),它在专家选择阶段对专家进行分组,并约束 token 在每个组内激活等量专家,从而实现设备间天然的负载均衡。基于 MoGE 架构,我们构建了总参数量 72B、激活参数量 16B 的盘古 Pro MoE 模型:
10
+
11
+ * 词表大小:153376
12
+ * 层数: 48
13
+ * MoGE 配置:4 个共享专家,64 个路由专家分 8 组、每组激活 1 个专家
14
+ * 训练阶段:预训练和后训练
15
+ * 预训练预料:15T
16
+
17
+
18
+ 详细报告参见:
19
+ * 中文技术报告地址:[盘古 Pro MoE:昇腾原生的分组混合专家模型](https://gitcode.com/ascend-tribe/pangu-pro-moe/blob/main/Pangu-Pro-MoE-CN-Report.pdf)
20
+ * 英文技术报告地址:[Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity](https://arxiv.org/abs/2505.21411)
21
+
22
+
23
+ ## 推理样例
24
+
25
+ [昇腾推理系统加速代码](https://gitcode.com/ascend-tribe/ascend-inference-system)和MindIE 与 vLLM-Ascend 配套软件版本已经推出。量化权重将于近期推出,敬请期待。
26
+
27
+ #### Transformers 推理样例
28
+
29
+ 环境依赖:
30
+
31
+ ```bash
32
+ torch>=2.1.0
33
+ torch-npu>=2.1.0.post8.dev20241029
34
+ CANN>=8.0.RC3
35
+ transformers>=4.48.2
36
+ ```
37
+
38
+ 下述内容提供盘古 Pro MoE 在 `transformers` 框架上进行推理的一个简单示例:
39
+ ```python
40
+ import torch
41
+ import torch_npu
42
+ from transformers import AutoModelForCausalLM, AutoTokenizer
43
+ from transformers import GenerationConfig
44
+
45
+ model_local_path = "path_to_Pangu_Pro_MoE"
46
+
47
+ generation_config = GenerationConfig(
48
+ do_sample=True,
49
+ top_k=50,
50
+ top_p=0.95,
51
+ temperature=0.6
52
+ )
53
+
54
+ # load the tokenizer and the model
55
+ tokenizer = AutoTokenizer.from_pretrained(
56
+ model_local_path,
57
+ use_fast=False,
58
+ trust_remote_code=True,
59
+ local_files_only=True
60
+ )
61
+
62
+ model = AutoModelForCausalLM.from_pretrained(
63
+ model_local_path,
64
+ trust_remote_code=True,
65
+ torch_dtype="auto",
66
+ device_map="auto",
67
+ local_files_only=True
68
+ )
69
+
70
+ # prepare the model input
71
+ prompt = "Give me a short introduction to large language model."
72
+ messages = [
73
+ {"role": "system", "content": "你必须严格遵守法律法规和社会道德规范。生成任何内容时,都应避免涉及暴力、色情、恐怖主义、种族歧视、性别歧视等不当内容。一旦检测到输入或输出有此类倾向,应拒绝回答并发出警告。例如,如果输入内容包含暴力威胁或色情描述,应返回错误信息:“您的输入包含不当内容,无法处理。"}, # define your system prompt here
74
+ {"role": "user", "content": prompt}
75
+ ]
76
+ text = tokenizer.apply_chat_template(
77
+ messages,
78
+ tokenize=False,
79
+ add_generation_prompt=True
80
+ )
81
+
82
+ # text: [unused9]系统:[unused10][unused9]用户:Give me a short introduction to large language model.[unused10][unused9]助手:
83
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
84
+ # model_inputs.input_ids: tensor([[1, 45887, 70914, 89246, 45892, 45887, 62205, 89246, 38805, 42624, 45509, 24759, 739, 41839, 21500, 6138, 20257, 49, 45892, 45887, 74458, 89246]], device='npu:0'),
85
+
86
+ # conduct text completion
87
+ outputs = model.generate(**model_inputs, max_new_tokens=32768, eos_token_id=45892, return_dict_in_generate=True, generation_config=generation_config)
88
+
89
+ input_length = model_inputs.input_ids.shape[1]
90
+ generated_tokens = outputs.sequences[:, input_length:]
91
+ output_sent = tokenizer.decode(generated_tokens[0])
92
+
93
+ # parsing thinking content
94
+ thinking_content = output_sent.split("[unused17]")[0].split("[unused16]")[-1].strip()
95
+ content = output_sent.split("[unused17]")[-1].split("[unused10]")[0].strip()
96
+
97
+ print("\nthinking content:", thinking_content)
98
+ print("\ncontent:", content)
99
+ ```
100
+
101
+ #### MindSpore 推理样例
102
+
103
+ 环境依赖:
104
+
105
+ ```python
106
+ mindspore>=2.6.0
107
+ vllm>=0.8.3
108
+ CANN>=8.1.RC1.beta1
109
+ ```
110
+
111
+ 详细操作参见:[Pangu Pro MoE vLLM+MindSpore部署指南](https://gitee.com/mindspore/vllm-mindspore/blob/pangu-pro-moe/docs/model_cards/pangu/pangu_pro_moe.md)
112
+
113
+ ## 完整性校验
114
+
115
+ 请参考以下方法对下载内容进行完整性校验,hash 值存储在 checklist.chk 文件中。
116
+
117
+ ```
118
+ #!/usr/bin/env bash
119
+ ARCH=$(uname -m)
120
+ MODEL_PATH="${TARGET_FOLDER}/${MODEL_FOLDER_PATH}"
121
+ cd "$MODEL_PATH" || exit 1
122
+ if [ "$ARCH" = "arm64" ]; then
123
+ md5 checklist.chk
124
+ else
125
+ md5sum -c checklist.chk
126
+ fi
127
+ ```
128
+
129
+ ## 模型许可证
130
+
131
+ Pangu Pro MoE 模型根据 Pangu Model License Agreement 授权,旨在允许使用并促进人工智能技术的进一步发展。有关详细信息,请参阅模型存储库根目录中的 `LICENSE` 文件。
132
+
133
+ ## 免责声明
134
+
135
+ 由于Pangu Pro MoE(“模型”)所依赖的技术固有的限制,以及人工智能生成的内容是由盘古自动生成的,我们无法对以下事项做出任何保证:
136
+
137
+ 1. 该模型的输出通过AI算法自动生成,不能排除某些信息可能存在缺陷、不合理或引起不适的可能性,生成的内容不代表华为的态度或立场;
138
+ 2. 无法保证该模型100%准确、可靠、功能齐全、及时、安全、无错误、不间断、持续稳定或无任何故障;
139
+ 3. 该模型的输出内容不构成任何建议或决策,也不保证生成的内容的真实性、完整性、准确性、及时性、合法性、功能性或实用性。生成的内容不能替代医疗、法律等领域的专业人士回答您的问题。生成的内容仅供参考,不代表华为的任何态度、立场或观点。您需要根据实际情况做出独立判断,华为不承担任何责任。
140
+
141
+ ## 引用
142
+
143
+ 如果觉得我们的工作有帮助,欢迎引用。
144
+
145
+ ```bibtex
146
+ @article{tang2025pangu,
147
+ title={Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity},
148
+ author={Tang, Yehui and Li, Xiaosong and Liu, Fangcheng and Guo, Wei and Zhou, Hang and Wang, Yaoyuan and Han, Kai and Yu, Xianzhi and Li, Jinpeng and Zang, Hui and others},
149
+ journal={arXiv preprint arXiv:2505.21411},
150
+ year={2025}
151
+ }
152
+ ```
153
+
154
+ ## 反馈
155
+
156
+ 如果有任何意见和建议,请提交issue或联系[email protected]
README_EN.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
2
+
3
+ ### Model Introduction
4
+
5
+ ![arch.PNG](https://raw.gitcode.com/ascend-tribe/pangu-pro-moe/blobs/7c83eb5c52ab91ba4bf2f8235ac1d0b1f9b49a7d/arch.PNG)
6
+
7
+ We introduce a novel Mixture of Grouped Experts (MoGE) architecture that partitions experts into distinct groups during the selection phase. By enforcing an equal number of expert activations per group for each token, MoGE inherently achieves load balancing across devices. Leveraging this architecture, we have developed the Pangu Pro MoE model with the following specifications:
8
+
9
+ - Vocabulary Size: 153,376
10
+ - Layers: 48
11
+ - MoGE Configuration: 4 shared experts, 64 routing experts grouped into 8 clusters with 1 expert activated per group
12
+ - Training Phases: Pretraining and Post-training
13
+ - Pretraining Corpus: 15TB
14
+
15
+ For detailed technical documentation, please refer to:
16
+
17
+ - **Chinese Technical Report**: [盘古 Pro MoE:昇腾原生的分组混合专家模型](https://gitcode.com/ascend-tribe/pangu-pro-moe/blob/main/Pangu-Pro-MoE-CN-Report.pdf)
18
+ - **English Technical Report**: [Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity](https://arxiv.org/abs/2505.21411)
19
+
20
+
21
+
22
+
23
+ ## Inference Examples
24
+
25
+ The acceleration code for the [Ascend inference acceleration code](https://gitcode.com/ascend-tribe/ascend-inference-system), along with supporting software versions of MindIE and vLLM-Ascend, has been officially released. The quantized weights will be rolled out in the near term. We kindly invite you to stay tuned for the upcoming release.
26
+
27
+ #### Transformers Inference
28
+
29
+ Environment Dependencies:
30
+
31
+ ```bash
32
+ torch>=2.1.0
33
+ torch-npu>=2.1.0.post8.dev20241029
34
+ CANN>=8.0.RC3
35
+ transformers>=4.48.2
36
+ ```
37
+
38
+ The following provides a simple inference example of Pangu Pro MoE based on the `transformers` framework:
39
+
40
+ ```python
41
+ import torch
42
+ import torch_npu
43
+ from transformers import AutoModelForCausalLM, AutoTokenizer
44
+ from transformers import GenerationConfig
45
+
46
+ model_local_path = "path_to_Pangu_Pro_MoE"
47
+
48
+ generation_config = GenerationConfig(
49
+ do_sample=True,
50
+ top_k=50,
51
+ top_p=0.95,
52
+ temperature=0.6
53
+ )
54
+
55
+ # load the tokenizer and the model
56
+ tokenizer = AutoTokenizer.from_pretrained(
57
+ model_local_path,
58
+ use_fast=False,
59
+ trust_remote_code=True,
60
+ local_files_only=True
61
+ )
62
+
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ model_local_path,
65
+ trust_remote_code=True,
66
+ torch_dtype="auto",
67
+ device_map="auto",
68
+ local_files_only=True
69
+ )
70
+
71
+ # prepare the model input
72
+ prompt = "Give me a short introduction to large language model."
73
+ messages = [
74
+ {"role": "system", "content": "你必须严格遵守法律法规和社会道德规范。生成任何内容时,都应避免涉及暴力、色情、恐怖主义、种族歧视、性别歧视等不当内容。一旦检测到输入或输出有此类倾向,应拒绝回答并发出警告。例如,如果输入内容包含暴力威胁或色情描述,应返回错误信息:“您的输入包含不当内容,无法处理。"}, # define your system prompt here
75
+ {"role": "user", "content": prompt}
76
+ ]
77
+ text = tokenizer.apply_chat_template(
78
+ messages,
79
+ tokenize=False,
80
+ add_generation_prompt=True
81
+ )
82
+
83
+ # text: [unused9]系统:[unused10][unused9]用户:Give me a short introduction to large language model.[unused10][unused9]助手:
84
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
85
+ # model_inputs.input_ids: tensor([[1, 45887, 70914, 89246, 45892, 45887, 62205, 89246, 38805, 42624, 45509, 24759, 739, 41839, 21500, 6138, 20257, 49, 45892, 45887, 74458, 89246]], device='npu:0'),
86
+
87
+ # conduct text completion
88
+ outputs = model.generate(**model_inputs, max_new_tokens=32768, eos_token_id=45892, return_dict_in_generate=True, generation_config=generation_config)
89
+
90
+ input_length = model_inputs.input_ids.shape[1]
91
+ generated_tokens = outputs.sequences[:, input_length:]
92
+ output_sent = tokenizer.decode(generated_tokens[0])
93
+
94
+ # parsing thinking content
95
+ thinking_content = output_sent.split("[unused17]")[0].split("[unused16]")[-1].strip()
96
+ content = output_sent.split("[unused17]")[-1].split("[unused10]")[0].strip()
97
+
98
+ print("\nthinking content:", thinking_content)
99
+ print("\ncontent:", content)
100
+ ```
101
+
102
+ #### MindSpore Inference
103
+
104
+ Environment Dependencies:
105
+
106
+ ```python
107
+ mindspore>=2.6.0
108
+ vllm>=0.8.3
109
+ CANN>=8.1.RC1.beta1
110
+ ```
111
+
112
+ For detailed instructions, please refer to [Pangu Pro MoE vLLM+MindSpore Deployment Instructions](https://gitee.com/mindspore/vllm-mindspore/blob/pangu-pro-moe/docs/model_cards/pangu/pangu_pro_moe.md).
113
+
114
+ ## Integrity Check
115
+
116
+ Please refer to the following methods to verify the integrity of the downloaded content. The hash values are stored in the `checklist.chk` file.
117
+
118
+ ```
119
+ #!/usr/bin/env bash
120
+ ARCH=$(uname -m)
121
+ MODEL_PATH="${TARGET_FOLDER}/${MODEL_FOLDER_PATH}"
122
+ cd "$MODEL_PATH" || exit 1
123
+ if [ "$ARCH" = "arm64" ]; then
124
+ md5 checklist.chk
125
+ else
126
+ md5sum -c checklist.chk
127
+ fi
128
+ ```
129
+
130
+ ## Model License
131
+
132
+ Pangu Pro MoE model is licensed under the Pangu Model License Agreement, which is intended to be used permissively and enable the further development of artificial intelligence technologies. Please refer to the `LICENSE` file located in the root directory of the model repository for details.
133
+
134
+ ## Disclaimer
135
+
136
+ Due to the technical limitations inherent in the technology on which the Pangu Pro MoE (“Model”) relies and the fact that the artificial intelligence generated content is automatically produced by Model, we cannot make any guarantees regarding the following matters:
137
+
138
+ 1. The output of this Model is automatically generated via AI algorithms, it does not rule out the possibility that some of the information may be flawed, unreasonable, or cause discomfort, and the generated content does not represent Huawei's attitude or standpoint;
139
+ 2. There is no guarantee that this Model is 100% accurate, reliable, functional, timely, secure and safety, error-free, uninterrupted, continuously stable, or free of any faults;
140
+ 3. The output of this Model does not constitute any advices or decisions for you, and it does not guarantee the authenticity, completeness, accuracy, timeliness, legality, functionality, or practicality of the generated content. The generated content cannot replace professionals in medical, legal, and other fields in answering your questions. The generated content is for your reference only and does not represent any attitude, standpoint, or position of Huawei. You need to make independent judgments based on your actual situation, and Huawei does not assume any responsibilities.
141
+
142
+ ## Citation
143
+
144
+ If our work is helpful for your research or projects, we appreciate your citation.
145
+
146
+ ```bibtex
147
+ @article{tang2025pangu,
148
+ title={Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity},
149
+ author={Tang, Yehui and Li, Xiaosong and Liu, Fangcheng and Guo, Wei and Zhou, Hang and Wang, Yaoyuan and Han, Kai and Yu, Xianzhi and Li, Jinpeng and Zang, Hui and others},
150
+ journal={arXiv preprint arXiv:2505.21411},
151
+ year={2025}
152
+ }
153
+ ```
154
+
155
+ ## Contact
156
+
157
+ If you have any question, please raise an issue or contact us at [email protected]
checklist.chk ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 829f4f25c066d17f9de2ffeec699b11a config.json
2
+ f290f3c6ccbbd635cc14f7e76099ba9b configuration_pangu_moe.py
3
+ 5530c86e9de9c6eb6297bfe947517093 generation_config.json
4
+ e06a817abd24e832b9f1de3517cb138a model-00001-of-00029.safetensors
5
+ 3d513f2de7d36c079132709afbc6262f model-00002-of-00029.safetensors
6
+ 8a14cbb35ab071dad02b486cdc2fcf8c model-00003-of-00029.safetensors
7
+ 7ca7715fd337e1b96c650a8004b20efb model-00004-of-00029.safetensors
8
+ fe93ef618011e5773f5488f29d951196 model-00005-of-00029.safetensors
9
+ 349af35e6c5a62ce99ce4e28f93d3769 model-00006-of-00029.safetensors
10
+ ef8bdc390117bd5abf808bbbec937aa6 model-00007-of-00029.safetensors
11
+ a947b13ca6d29c8eb0b35edd4b8040f7 model-00008-of-00029.safetensors
12
+ 38f75ca1322a99825688203c4c7bdbc3 model-00009-of-00029.safetensors
13
+ 98347b24892c854d524c1a210f7b0225 model-00010-of-00029.safetensors
14
+ 23c76ee5caed0bed1298bb7b72be04e9 model-00011-of-00029.safetensors
15
+ 2000a3690bcc7e291d46c4b5c3fd4f63 model-00012-of-00029.safetensors
16
+ c97f087edfd685edce5ec1baadfe2ec2 model-00013-of-00029.safetensors
17
+ 81521d10a99d3f89d8ecfa0ce490533e model-00014-of-00029.safetensors
18
+ 2ef05292e2f95bacef0f46caffb6c996 model-00015-of-00029.safetensors
19
+ 2a3f50bc3d88c06a62c96a0e43bcb5f2 model-00016-of-00029.safetensors
20
+ eec6364b2bb34c09d191523cc75996be model-00017-of-00029.safetensors
21
+ acd61e24dc8fa8e82c98e85ee742bd9b model-00018-of-00029.safetensors
22
+ 827618d8295435aeaf55bd67380a5087 model-00019-of-00029.safetensors
23
+ d44869a0b3f53d3173ab02393988d185 model-00020-of-00029.safetensors
24
+ ea0386426fd55cf30a3ab43ccf0d41d8 model-00021-of-00029.safetensors
25
+ 767977eaaa0a0c8898cd9e64eeacdb57 model-00022-of-00029.safetensors
26
+ 7896a81ef2aae2b20598b991db0d364f model-00023-of-00029.safetensors
27
+ 304875e4e17a53e3e5625971dbb6c707 model-00024-of-00029.safetensors
28
+ 3a6e14ade4691a0a59a97500edc16ee9 model-00025-of-00029.safetensors
29
+ 337cdfb024742835baf73194f5b13e20 model-00026-of-00029.safetensors
30
+ 312ee84ac27c6b349f252fafb8a603c3 model-00027-of-00029.safetensors
31
+ 57aa6d77b4a1fea2c79ca7d762a2103a model-00028-of-00029.safetensors
32
+ 6282ccbb1014d1dae9843a91c3c854bb model-00029-of-00029.safetensors
33
+ f704d10889b7ce5c3abb01b89fe7a429 modeling_pangu_moe.py
34
+ 6813f7e33cf845413cbcf432836d2cb6 model.safetensors.index.json
35
+ 3296eaa8d86a025b3357155b4220b8f2 special_tokens_map.json
36
+ 802ac2995192a488eb0997c4dfcc70b0 tokenization_pangu_moe.py
37
+ 0a60ccca2283b2dc3e2fa91d01501de1 tokenizer_config.json
38
+ dcdad36664804ecfce35aeb7d27dc65f tokenizer.model
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "PanguProMoEForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_pangu_moe.PanguProMoEConfig",
8
+ "AutoModel": "modeling_pangu_moe.PanguProMoEModel",
9
+ "AutoModelForCausalLM": "modeling_pangu_moe.PanguProMoEForCausalLM"
10
+ },
11
+ "bos_token_id": 1,
12
+ "eos_token_id": 45892,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 5120,
15
+ "initializer_range": 0.02,
16
+ "max_position_embeddings": 131072,
17
+ "model_type": "PanguProMoE",
18
+ "moe_intermediate_size": 1344,
19
+ "num_attention_heads": 40,
20
+ "num_experts": 64,
21
+ "num_experts_per_tok": 8,
22
+ "num_hidden_layers": 48,
23
+ "num_key_value_heads": 8,
24
+ "output_router_logits": false,
25
+ "rms_norm_eps": 1e-05,
26
+ "rope_theta": 16000000.0,
27
+ "router_aux_loss_coef": 0.001,
28
+ "shared_expert_intermediate_size": 5376,
29
+ "tie_word_embeddings": false,
30
+ "torch_dtype": "bfloat16",
31
+ "transformers_version": "4.48.2",
32
+ "use_cache": true,
33
+ "vocab_size": 153376
34
+ }
configuration_pangu_moe.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
3
+ # Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ PanguProMoE model configuration"""
17
+
18
+
19
+ from transformers.configuration_utils import PretrainedConfig
20
+ from transformers.utils import logging
21
+
22
+
23
+ logger = logging.get_logger(__name__)
24
+
25
+
26
+ class PanguProMoEConfig(PretrainedConfig):
27
+
28
+ model_type = "PanguProMoE"
29
+ _auto_class = "AutoConfig"
30
+
31
+ def __init__(
32
+ self,
33
+ vocab_size=153376,
34
+ hidden_size=5120,
35
+ num_hidden_layers=48,
36
+ num_attention_heads=40,
37
+ num_key_value_heads=8,
38
+ hidden_act="silu",
39
+ max_position_embeddings=131072,
40
+ initializer_range=0.02,
41
+ rms_norm_eps=1e-5,
42
+ use_cache=True,
43
+ tie_word_embeddings=False,
44
+ rope_theta=16000000.0,
45
+ moe_intermediate_size=1344,
46
+ shared_expert_intermediate_size=5376,
47
+ num_experts_per_tok=8,
48
+ num_experts=64,
49
+ output_router_logits=False,
50
+ router_aux_loss_coef=0.001,
51
+ **kwargs,
52
+ ):
53
+ self.vocab_size = vocab_size
54
+ self.max_position_embeddings = max_position_embeddings
55
+ self.hidden_size = hidden_size
56
+ self.num_hidden_layers = num_hidden_layers
57
+ self.num_attention_heads = num_attention_heads
58
+ self.num_key_value_heads = num_key_value_heads
59
+ self.hidden_act = hidden_act
60
+ self.initializer_range = initializer_range
61
+ self.rms_norm_eps = rms_norm_eps
62
+ self.use_cache = use_cache
63
+ self.rope_theta = rope_theta
64
+
65
+ # MoE arguments
66
+ self.moe_intermediate_size = moe_intermediate_size
67
+ self.shared_expert_intermediate_size = shared_expert_intermediate_size
68
+ self.num_experts_per_tok = num_experts_per_tok
69
+ self.num_experts = num_experts
70
+ self.output_router_logits = output_router_logits
71
+ self.router_aux_loss_coef = router_aux_loss_coef
72
+
73
+ super().__init__(
74
+ tie_word_embeddings=tie_word_embeddings,
75
+ **kwargs,
76
+ )
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 45892
7
+ ],
8
+ "pad_token_id": 0,
9
+ "temperature": 0.6,
10
+ "top_k": 50,
11
+ "top_p": 0.95,
12
+ "transformers_version": "4.48.2"
13
+ }
model-00001-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95c716672541890f5c5022270a54990126a4e6672589c413dcd1e22fe5370faa
3
+ size 4989027928
model-00002-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07f49b3beb42046fb0923af9ad9520f0a9739478ba51ba33f2d81e64c152fc27
3
+ size 4998519992
model-00003-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c49c6587c5baa652facbb731a3253a5f8bc411fc799f397ab83821b4070c3df
3
+ size 4987423040
model-00004-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c38c97fe9d33fbc87d4285172c82a36b4afe738b7c2e38c005a3517624f4843d
3
+ size 4987423024
model-00005-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cf3a82b81a428b80470875b82520b013c04da1d49fd548379e80dab7368a704f
3
+ size 4970994256
model-00006-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb5ea7ee959cda3da5b1a91c97eff0120a29fcbaa3063070df8cdb78d7468f7c
3
+ size 4987423416
model-00007-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6deda3d3e089543d46c6a7f325d4a6d465512799bd82f30236404e958110529
3
+ size 4987423312
model-00008-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2146dea73c97dc76fec25b5187816a73fa04286e6e954119cf88aa8192a96bc1
3
+ size 4987423408
model-00009-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc2ceffc167099e3f7fd57d7d90fd85fc361c43185abaed066c2114ea2eeaaaa
3
+ size 4998520328
model-00010-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5511b5e9f4e76a78eed5a7b73affedc552a7a7cf5db6081c506741d8399ccd7f
3
+ size 4987423392
model-00011-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:84b52b243e7e0369beddebef1ac7abe808a7beb0dc22bf2e3be3110ea2628557
3
+ size 4987423384
model-00012-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f371a1f71d115b4b0dbcc8a31646ad25a49069d31b4fe9e38f4415ed3c782623
3
+ size 4998520352
model-00013-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9e13d22c1aefde04d032e200e9e458e9c29e70c34d393987d6f0e935fdbfb39
3
+ size 4987423392
model-00014-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21a68d0437cf72e8bb1bb828ec2aa90fa0f3f76d02b924a91768e55d26d3427c
3
+ size 4987423376
model-00015-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:238d040f730ba1880a1348f7aa53dfd9ba458f96c828f2237a1c263ab14a822e
3
+ size 4970994608
model-00016-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d5eb1ce1c1b280ca1de650c35531d18892bbb6d4567635955ca2741908b9224d
3
+ size 4987423768
model-00017-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9f7c4ed8b56cf936a529769fedb10ff6520beaa27ddf8441ee6b889e88ceb2a
3
+ size 4987423392
model-00018-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60bea02af65c206ef49458a8f9de5722d52d0795fe35d3fad915e1bbf8cba9f0
3
+ size 4987423408
model-00019-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f63cc4c14f29c0cca4d97d38085281de8487c2461a6e443f57037395a5e99c6
3
+ size 4998520328
model-00020-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:08d808cadb7d9621c09c6a9844c069ca7657981700b91397076ff5ae18cfc4ac
3
+ size 4987423392
model-00021-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d1522cfd7d6703a3b7ee9596d9bc0ae7e4ab131e0952ecbf12a3553dfe2e289
3
+ size 4987423384
model-00022-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e1e0931fde6b95159195448ba1c78d4c571ef23c0061304ec0b8427fa86e8e27
3
+ size 4998520352
model-00023-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab6c0cc37a2174e6badc32681c0a41f64c45c10882a4ef546ca1d8c52d97e20d
3
+ size 4987423392
model-00024-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bc63ade60fe2eb1da438cb4b1c9d79e9bc9554c596fa518e8365630996c01409
3
+ size 4987423376
model-00025-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f78276da8b9bbb83f0e369eed9cc73c9fbc7361bc5947ba8f02500bf69f16a2
3
+ size 4970994608
model-00026-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd71fc3871dfa8e2f0bb70960ada6dc6614778b7afc71d5e9622c9acbacb168c
3
+ size 4987423768
model-00027-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:107a2a23aa269f9b3c98999f80d151f59e8555235a4f225c0acf7b1978f6fa8f
3
+ size 4987423392
model-00028-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1495a49d1f23cb5080e9724cc7e691823e54a164aaeb12fc3e5c7582d2042adb
3
+ size 4987423408
model-00029-of-00029.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:895c0bfe50125b81e3d0d0e04463881da7c21407679f09a83866a9aef070ac45
3
+ size 4323137336
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_pangu_moe.py ADDED
@@ -0,0 +1,1004 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
3
+ # Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
6
+ # and OPT implementations in this library. It has been modified from its
7
+ # original forms to accommodate minor architectural differences compared
8
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+ """ PyTorch PanguProMoE model."""
22
+ import math
23
+ from typing import List, Optional, Tuple, Union
24
+
25
+ import sys
26
+ import torch
27
+ import torch.nn.functional as F
28
+ import torch.utils.checkpoint
29
+ from torch import nn
30
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
31
+
32
+ from transformers.activations import ACT2FN
33
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
34
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
35
+ from transformers.modeling_outputs import (
36
+ MoeCausalLMOutputWithPast,
37
+ MoeModelOutputWithPast,
38
+ )
39
+ from transformers.modeling_utils import PreTrainedModel
40
+ from transformers.utils import (
41
+ add_start_docstrings,
42
+ add_start_docstrings_to_model_forward,
43
+ logging,
44
+ replace_return_docstrings,
45
+ )
46
+
47
+ from .configuration_pangu_moe import PanguProMoEConfig
48
+
49
+
50
+ logger = logging.get_logger(__name__)
51
+
52
+ _CONFIG_FOR_DOC = "PanguProMoEConfig"
53
+
54
+
55
+ class NotSupportedError(Exception):
56
+ def __str__(self):
57
+ return "NotSupportedError"
58
+
59
+ def check_config(top_k, num_experts):
60
+ if top_k == 8 and num_experts == 64:
61
+ return
62
+ raise NotSupportedError()
63
+
64
+
65
+ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position
66
+ def _prepare_4d_causal_attention_mask_with_cache_position(
67
+ attention_mask: torch.Tensor,
68
+ sequence_length: int,
69
+ target_length: int,
70
+ dtype: torch.dtype,
71
+ device: torch.device,
72
+ min_dtype: float,
73
+ cache_position: torch.Tensor,
74
+ batch_size: int,
75
+ ):
76
+ if attention_mask is not None and attention_mask.dim() == 4:
77
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
78
+ causal_mask = attention_mask
79
+ else:
80
+ causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
81
+ if sequence_length != 1:
82
+ causal_mask = torch.triu(causal_mask, diagonal=1)
83
+ causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
84
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
85
+ if attention_mask is not None:
86
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
87
+ mask_length = attention_mask.shape[-1]
88
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
89
+ padding_mask = padding_mask == 0
90
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
91
+ padding_mask, min_dtype
92
+ )
93
+
94
+ return causal_mask
95
+
96
+
97
+ # Copied from transformers.models.mixtral.modeling_mixtral.load_balancing_loss_func
98
+ def load_balancing_loss_func(
99
+ gate_logits: torch.Tensor, num_experts: torch.Tensor = None, top_k=2, attention_mask: Optional[torch.Tensor] = None
100
+ ) -> float:
101
+ r"""
102
+ Computes auxiliary load balancing loss as in Switch Transformer - implemented in Pytorch.
103
+
104
+ See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss
105
+ function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between
106
+ experts is too unbalanced.
107
+
108
+ Args:
109
+ gate_logits (Union[`torch.Tensor`, Tuple[torch.Tensor]):
110
+ Logits from the `gate`, should be a tuple of model.config.num_hidden_layers tensors of
111
+ shape [batch_size X sequence_length, num_experts].
112
+ attention_mask (`torch.Tensor`, *optional*):
113
+ The attention_mask used in forward function
114
+ shape [batch_size X sequence_length] if not None.
115
+ num_experts (`int`, *optional*):
116
+ Number of experts
117
+
118
+ Returns:
119
+ The auxiliary loss.
120
+ """
121
+ if gate_logits is None or not isinstance(gate_logits, tuple):
122
+ return 0
123
+
124
+ if isinstance(gate_logits, tuple):
125
+ compute_device = gate_logits[0].device
126
+ concatenated_gate_logits = torch.cat([layer_gate.to(compute_device) for layer_gate in gate_logits], dim=0)
127
+
128
+ routing_weights = torch.nn.functional.softmax(concatenated_gate_logits, dim=-1)
129
+
130
+ _, selected_experts = torch.topk(routing_weights, top_k, dim=-1)
131
+
132
+ expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts)
133
+
134
+ if attention_mask is None:
135
+ # Compute the percentage of tokens routed to each experts
136
+ tokens_per_expert = torch.mean(expert_mask.float(), dim=0)
137
+
138
+ # Compute the average probability of routing to these experts
139
+ router_prob_per_expert = torch.mean(routing_weights, dim=0)
140
+ else:
141
+ batch_size, sequence_length = attention_mask.shape
142
+ num_hidden_layers = concatenated_gate_logits.shape[0] // (batch_size * sequence_length)
143
+
144
+ # Compute the mask that masks all padding tokens as 0 with the same shape of expert_mask
145
+ expert_attention_mask = (
146
+ attention_mask[None, :, :, None, None]
147
+ .expand((num_hidden_layers, batch_size, sequence_length, top_k, num_experts))
148
+ .reshape(-1, top_k, num_experts)
149
+ .to(compute_device)
150
+ )
151
+
152
+ # Compute the percentage of tokens routed to each experts
153
+ tokens_per_expert = torch.sum(expert_mask.float() * expert_attention_mask, dim=0) / torch.sum(
154
+ expert_attention_mask, dim=0
155
+ )
156
+
157
+ # Compute the mask that masks all padding tokens as 0 with the same shape of tokens_per_expert
158
+ router_per_expert_attention_mask = (
159
+ attention_mask[None, :, :, None]
160
+ .expand((num_hidden_layers, batch_size, sequence_length, num_experts))
161
+ .reshape(-1, num_experts)
162
+ .to(compute_device)
163
+ )
164
+
165
+ # Compute the average probability of routing to these experts
166
+ router_prob_per_expert = torch.sum(routing_weights * router_per_expert_attention_mask, dim=0) / torch.sum(
167
+ router_per_expert_attention_mask, dim=0
168
+ )
169
+
170
+ overall_loss = torch.sum(tokens_per_expert * router_prob_per_expert.unsqueeze(0))
171
+ return overall_loss * num_experts
172
+
173
+
174
+ class PanguProMoERMSNorm(nn.Module):
175
+ def __init__(self, hidden_size, eps=1e-5):
176
+ super().__init__()
177
+ self.weight = nn.Parameter(torch.ones(hidden_size))
178
+ self.variance_epsilon = eps
179
+
180
+ def forward(self, hidden_states):
181
+ input_dtype = hidden_states.dtype
182
+ hidden_states = hidden_states.to(torch.float32)
183
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
184
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
185
+ return self.weight * hidden_states.to(input_dtype)
186
+
187
+ def extra_repr(self):
188
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
189
+
190
+ class PanguProMoERotaryEmbedding(nn.Module):
191
+ def __init__(self, dim, max_position_embeddings=131072, base=16000000.0, device=None):
192
+ super().__init__()
193
+
194
+ self.dim = dim
195
+ self.max_position_embeddings = max_position_embeddings
196
+ self.base = base
197
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
198
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
199
+
200
+ # Build here to make `torch.jit.trace` work.
201
+ self._set_cos_sin_cache(
202
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
203
+ )
204
+
205
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
206
+ self.max_seq_len_cached = seq_len
207
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
208
+
209
+ freqs = torch.outer(t, self.inv_freq)
210
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
211
+ emb = torch.cat((freqs, freqs), dim=-1)
212
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
213
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
214
+
215
+ def forward(self, x, seq_len=None):
216
+ # x: [bs, num_attention_heads, seq_len, head_size]
217
+ if seq_len > self.max_seq_len_cached:
218
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
219
+
220
+ return (
221
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
222
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
223
+ )
224
+
225
+
226
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
227
+ def rotate_half(x):
228
+ """Rotates half the hidden dims of the input."""
229
+ x1 = x[..., : x.shape[-1] // 2]
230
+ x2 = x[..., x.shape[-1] // 2 :]
231
+ return torch.cat((-x2, x1), dim=-1)
232
+
233
+
234
+ # Copied from transformers.models.mixtral.modeling_mixtral.apply_rotary_pos_emb
235
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
236
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
237
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
238
+ q_embed = (q * cos) + (rotate_half(q) * sin)
239
+ k_embed = (k * cos) + (rotate_half(k) * sin)
240
+ return q_embed, k_embed
241
+
242
+
243
+ # Modified from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->PanguProMoE
244
+ class PanguProMoEMLP(nn.Module):
245
+ def __init__(self, config, intermediate_size=None):
246
+ super().__init__()
247
+ self.config = config
248
+ self.hidden_size = config.hidden_size
249
+ self.intermediate_size = intermediate_size
250
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
251
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
252
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
253
+ self.act_fn = ACT2FN[config.hidden_act]
254
+
255
+ def forward(self, x):
256
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
257
+
258
+
259
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
260
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
261
+ """
262
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
263
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
264
+ """
265
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
266
+ if n_rep == 1:
267
+ return hidden_states
268
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
269
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
270
+
271
+
272
+ class PanguProMoEAttention(nn.Module):
273
+ """
274
+ Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
275
+ and "Generating Long Sequences with Sparse Transformers".
276
+ """
277
+ def __init__(self, config: PanguProMoEConfig, layer_idx: Optional[int] = None):
278
+ super().__init__()
279
+ self.config = config
280
+ self.layer_idx = layer_idx
281
+ if layer_idx is None:
282
+ logger.warning_once(
283
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
284
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
285
+ "when creating this class."
286
+ )
287
+
288
+ self.hidden_size = config.hidden_size
289
+ self.num_heads = config.num_attention_heads
290
+ self.head_dim = self.hidden_size // self.num_heads
291
+ self.num_key_value_heads = config.num_key_value_heads
292
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
293
+ self.max_position_embeddings = config.max_position_embeddings
294
+ self.rope_theta = config.rope_theta
295
+ self.is_causal = True
296
+ self.attention_dropout = config.attention_dropout
297
+
298
+ if (self.head_dim * self.num_heads) != self.hidden_size:
299
+ raise ValueError(
300
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
301
+ f" and `num_heads`: {self.num_heads})."
302
+ )
303
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
304
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
305
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
306
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=True)
307
+
308
+ self.rotary_emb = PanguProMoERotaryEmbedding(
309
+ self.head_dim,
310
+ max_position_embeddings=self.max_position_embeddings,
311
+ base=self.rope_theta,
312
+ )
313
+
314
+ def forward(
315
+ self,
316
+ hidden_states: torch.Tensor,
317
+ attention_mask: Optional[torch.Tensor] = None,
318
+ position_ids: Optional[torch.LongTensor] = None,
319
+ past_key_value: Optional[Cache] = None,
320
+ output_attentions: bool = False,
321
+ use_cache: bool = False,
322
+ cache_position: Optional[torch.LongTensor] = None,
323
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
324
+ bsz, q_len, _ = hidden_states.size()
325
+
326
+ query_states = self.q_proj(hidden_states)
327
+ key_states = self.k_proj(hidden_states)
328
+ value_states = self.v_proj(hidden_states)
329
+
330
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
331
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
332
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
333
+
334
+ kv_seq_len = key_states.shape[-2]
335
+ if past_key_value is not None:
336
+ if self.layer_idx is None:
337
+ raise ValueError(
338
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
339
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
340
+ "with a layer index."
341
+ )
342
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
343
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
344
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
345
+
346
+ if past_key_value is not None:
347
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
348
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
349
+
350
+ # repeat k/v heads if n_kv_heads < n_heads
351
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
352
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
353
+
354
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
355
+
356
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
357
+ raise ValueError(
358
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
359
+ f" {attn_weights.size()}"
360
+ )
361
+
362
+ if attention_mask is not None: # no matter the length, we just slice it
363
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
364
+ attn_weights = attn_weights + causal_mask
365
+
366
+ # upcast attention to fp32
367
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
368
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
369
+ attn_output = torch.matmul(attn_weights, value_states)
370
+
371
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
372
+ raise ValueError(
373
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
374
+ f" {attn_output.size()}"
375
+ )
376
+
377
+ attn_output = attn_output.transpose(1, 2).contiguous()
378
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
379
+
380
+ attn_output = self.o_proj(attn_output)
381
+
382
+ if not output_attentions:
383
+ attn_weights = None
384
+
385
+ return attn_output, attn_weights, past_key_value
386
+
387
+
388
+ class PanguProMoESparseMoeBlock(nn.Module):
389
+ def __init__(self, config):
390
+ super().__init__()
391
+ self.num_experts = config.num_experts
392
+ self.top_k = config.num_experts_per_tok
393
+
394
+ # for Pangu Pro MoE
395
+ check_config(self.top_k, self.num_experts)
396
+ self.num_groups = 8
397
+ self.experts_per_group = self.num_experts // self.num_groups
398
+
399
+ # gating
400
+ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)
401
+ self.experts = nn.ModuleList(
402
+ [PanguProMoEMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)]
403
+ )
404
+ self.shared_expert = PanguProMoEMLP(config, intermediate_size=config.shared_expert_intermediate_size)
405
+ self.router_scale = torch.nn.Parameter(torch.ones((1, self.num_experts)))
406
+
407
+
408
+ def forward(self, hidden_states: torch.Tensor, layer_number:int) -> torch.Tensor:
409
+ batch_size, sequence_length, hidden_dim = hidden_states.shape
410
+ hidden_states = hidden_states.view(-1, hidden_dim)
411
+ router_logits = self.gate(hidden_states)
412
+ routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
413
+
414
+ routing_weights, selected_experts = torch.max(routing_weights.view(routing_weights.shape[0], self.num_groups, -1), dim = -1)
415
+ bias = torch.arange(0, self.num_experts, self.experts_per_group, device=routing_weights.device, dtype=torch.int64).unsqueeze(0)
416
+ selected_experts = selected_experts + bias
417
+
418
+ # we cast back to the input dtype
419
+ routing_weights = routing_weights.to(hidden_states.dtype)
420
+
421
+ final_hidden_states = torch.zeros(
422
+ (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
423
+ )
424
+
425
+ # One hot encode the selected experts to create an expert mask
426
+ # this will be used to easily index which expert is going to be sollicitated
427
+ # breakpoint()
428
+ expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0)
429
+
430
+ # Loop over all available experts in the model and perform the computation on each expert
431
+ for expert_idx in range(self.num_experts):
432
+ expert_layer = self.experts[expert_idx]
433
+ idx, top_x = torch.where(expert_mask[expert_idx])
434
+
435
+ # Index the correct hidden states and compute the expert hidden state for
436
+ # the current expert. We need to make sure to multiply the output hidden
437
+ # states by `routing_weights` on the corresponding tokens (top-1 and top-2)
438
+ current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
439
+ current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] * self.router_scale[:, expert_idx][0]
440
+
441
+ # However `index_add_` only support torch tensors for indexing so we'll use
442
+ # the `top_x` tensor here.
443
+ final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
444
+
445
+ shared_expert_output = self.shared_expert(hidden_states)
446
+ final_hidden_states = final_hidden_states + shared_expert_output
447
+
448
+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
449
+ return final_hidden_states, router_logits
450
+
451
+
452
+ class PanguProMoEDecoderLayer(nn.Module):
453
+ def __init__(self, config: PanguProMoEConfig, layer_idx: int):
454
+ super().__init__()
455
+ self.hidden_size = config.hidden_size
456
+ self.layer_idx = layer_idx
457
+ self.self_attn = PanguProMoEAttention(config, layer_idx)
458
+ self.mlp = PanguProMoESparseMoeBlock(config)
459
+ self.input_layernorm = PanguProMoERMSNorm(config.hidden_size, eps=config.rms_norm_eps)
460
+ self.post_attention_layernorm = PanguProMoERMSNorm(config.hidden_size, eps=config.rms_norm_eps)
461
+
462
+ def forward(
463
+ self,
464
+ hidden_states: torch.Tensor,
465
+ attention_mask: Optional[torch.Tensor] = None,
466
+ position_ids: Optional[torch.LongTensor] = None,
467
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
468
+ output_attentions: Optional[bool] = False,
469
+ output_router_logits: Optional[bool] = False,
470
+ use_cache: Optional[bool] = False,
471
+ cache_position: Optional[torch.LongTensor] = None,
472
+ **kwargs,
473
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
474
+ """
475
+ Args:
476
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
477
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
478
+ `(batch, sequence_length)` where padding elements are indicated by 0.
479
+ output_attentions (`bool`, *optional*):
480
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
481
+ returned tensors for more detail.
482
+ output_router_logits (`bool`, *optional*):
483
+ Whether or not to return the logits of all the routers. They are useful for computing the router loss,
484
+ and should not be returned during inference.
485
+ use_cache (`bool`, *optional*):
486
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
487
+ (see `past_key_values`).
488
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
489
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
490
+ Indices depicting the position of the input sequence tokens in the sequence.
491
+ kwargs (`dict`, *optional*):
492
+ Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
493
+ into the model
494
+ """
495
+ residual = hidden_states
496
+
497
+ hidden_states = self.input_layernorm(hidden_states)
498
+
499
+ # Self Attention
500
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
501
+ hidden_states=hidden_states,
502
+ attention_mask=attention_mask,
503
+ position_ids=position_ids,
504
+ past_key_value=past_key_value,
505
+ output_attentions=output_attentions,
506
+ use_cache=use_cache,
507
+ cache_position=cache_position,
508
+ )
509
+ hidden_states = residual + hidden_states
510
+
511
+ # Fully Connected
512
+ residual = hidden_states
513
+ hidden_states = self.post_attention_layernorm(hidden_states)
514
+
515
+ hidden_states = self.mlp(hidden_states, self.layer_idx)
516
+ if isinstance(hidden_states, tuple):
517
+ hidden_states, router_logits = hidden_states
518
+ else:
519
+ router_logits = None
520
+
521
+ hidden_states = residual + hidden_states
522
+
523
+ outputs = (hidden_states,)
524
+
525
+ if output_attentions:
526
+ outputs += (self_attn_weights,)
527
+
528
+ if use_cache:
529
+ outputs += (present_key_value,)
530
+
531
+ if output_router_logits:
532
+ outputs += (router_logits,)
533
+
534
+ return outputs
535
+
536
+
537
+ class PanguProMoEPreTrainedModel(PreTrainedModel):
538
+ config_class = PanguProMoEConfig
539
+ base_model_prefix = "model"
540
+ supports_gradient_checkpointing = True
541
+ _no_split_modules = ["PanguProMoEDecoderLayer"]
542
+ _skip_keys_device_placement = "past_key_values"
543
+ _supports_cache_class = True
544
+
545
+ def _init_weights(self, module):
546
+ std = self.config.initializer_range
547
+ if isinstance(module, nn.Linear):
548
+ module.weight.data.normal_(mean=0.0, std=std)
549
+ if module.bias is not None:
550
+ module.bias.data.zero_()
551
+ elif isinstance(module, nn.Embedding):
552
+ module.weight.data.normal_(mean=0.0, std=std)
553
+ if module.padding_idx is not None:
554
+ module.weight.data[module.padding_idx].zero_()
555
+
556
+
557
+ class PanguProMoEModel(PanguProMoEPreTrainedModel):
558
+ """
559
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`PanguProMoEDecoderLayer`]
560
+
561
+ Args:
562
+ config: PanguProMoEConfig
563
+ """
564
+
565
+ def __init__(self, config: PanguProMoEConfig):
566
+ super().__init__(config)
567
+ self.padding_idx = config.pad_token_id
568
+ self.vocab_size = config.vocab_size
569
+
570
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
571
+ self.layers = nn.ModuleList(
572
+ [PanguProMoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
573
+ )
574
+ self._attn_implementation = config._attn_implementation
575
+ self.norm = PanguProMoERMSNorm(config.hidden_size, eps=config.rms_norm_eps)
576
+
577
+ self.gradient_checkpointing = False
578
+ # Initialize weights and apply final processing
579
+ self.post_init()
580
+
581
+ def get_input_embeddings(self):
582
+ return self.embed_tokens
583
+
584
+ def set_input_embeddings(self, value):
585
+ self.embed_tokens = value
586
+
587
+ def forward(
588
+ self,
589
+ input_ids: torch.LongTensor = None,
590
+ attention_mask: Optional[torch.Tensor] = None,
591
+ position_ids: Optional[torch.LongTensor] = None,
592
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
593
+ inputs_embeds: Optional[torch.FloatTensor] = None,
594
+ use_cache: Optional[bool] = None,
595
+ output_attentions: Optional[bool] = None,
596
+ output_hidden_states: Optional[bool] = None,
597
+ output_router_logits: Optional[bool] = None,
598
+ return_dict: Optional[bool] = None,
599
+ cache_position: Optional[torch.LongTensor] = None,
600
+ ) -> Union[Tuple, MoeModelOutputWithPast]:
601
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
602
+ output_router_logits = (
603
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
604
+ )
605
+ output_hidden_states = (
606
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
607
+ )
608
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
609
+
610
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
611
+
612
+ if (input_ids is None) ^ (inputs_embeds is not None):
613
+ raise ValueError(
614
+ "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
615
+ )
616
+
617
+ if self.gradient_checkpointing and self.training:
618
+ if use_cache:
619
+ logger.warning_once(
620
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
621
+ )
622
+ use_cache = False
623
+
624
+ use_legacy_cache = False
625
+ if use_cache and not isinstance(past_key_values, Cache) and not self.training:
626
+ use_legacy_cache = True
627
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
628
+ logger.warning_once(
629
+ "We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. "
630
+ "Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)"
631
+ )
632
+
633
+ if inputs_embeds is None:
634
+ inputs_embeds = self.embed_tokens(input_ids)
635
+
636
+ if cache_position is None:
637
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
638
+ cache_position = torch.arange(
639
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
640
+ )
641
+ if position_ids is None:
642
+ position_ids = cache_position.unsqueeze(0)
643
+
644
+ causal_mask = self._update_causal_mask(
645
+ attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
646
+ )
647
+
648
+ hidden_states = inputs_embeds
649
+
650
+ # decoder layers
651
+ all_hidden_states = () if output_hidden_states else None
652
+ all_self_attns = () if output_attentions else None
653
+ all_router_logits = () if output_router_logits else None
654
+ next_decoder_cache = None
655
+
656
+ for decoder_layer in self.layers:
657
+ if output_hidden_states:
658
+ all_hidden_states += (hidden_states,)
659
+
660
+ if self.gradient_checkpointing and self.training:
661
+ layer_outputs = self._gradient_checkpointing_func(
662
+ decoder_layer.__call__,
663
+ hidden_states,
664
+ causal_mask,
665
+ position_ids,
666
+ past_key_values,
667
+ output_attentions,
668
+ output_router_logits,
669
+ use_cache,
670
+ cache_position,
671
+ )
672
+ else:
673
+ layer_outputs = decoder_layer(
674
+ hidden_states,
675
+ attention_mask=causal_mask,
676
+ position_ids=position_ids,
677
+ past_key_value=past_key_values,
678
+ output_attentions=output_attentions,
679
+ output_router_logits=output_router_logits,
680
+ use_cache=use_cache,
681
+ cache_position=cache_position,
682
+ )
683
+
684
+ hidden_states = layer_outputs[0]
685
+
686
+ if use_cache:
687
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
688
+
689
+ if output_attentions:
690
+ all_self_attns += (layer_outputs[1],)
691
+
692
+ if output_router_logits and layer_outputs[-1] is not None:
693
+ all_router_logits += (layer_outputs[-1],)
694
+
695
+ hidden_states = self.norm(hidden_states)
696
+
697
+ # add hidden states from the last decoder layer
698
+ if output_hidden_states:
699
+ all_hidden_states += (hidden_states,)
700
+
701
+ next_cache = None
702
+ if use_cache:
703
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
704
+
705
+ if not return_dict:
706
+ return tuple(
707
+ v
708
+ for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_router_logits]
709
+ if v is not None
710
+ )
711
+ return MoeModelOutputWithPast(
712
+ last_hidden_state=hidden_states,
713
+ past_key_values=next_cache,
714
+ hidden_states=all_hidden_states,
715
+ attentions=all_self_attns,
716
+ router_logits=all_router_logits,
717
+ )
718
+
719
+ # Copied from transformers.models.llama.modeling_llama.LlamaModel._update_causal_mask
720
+ def _update_causal_mask(
721
+ self,
722
+ attention_mask: torch.Tensor,
723
+ input_tensor: torch.Tensor,
724
+ cache_position: torch.Tensor,
725
+ past_key_values: Cache,
726
+ output_attentions: bool,
727
+ ):
728
+ # TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length even when the static
729
+ # KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at each decode steps due to the dynamic shapes.
730
+ # (`recording cudagraph tree for symint key 13`, etc.), which is VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using
731
+ # `fullgraph=True`. See more context in https://github.com/huggingface/transformers/pull/29114
732
+
733
+ if self.config._attn_implementation == "flash_attention_2":
734
+ if attention_mask is not None and 0.0 in attention_mask:
735
+ return attention_mask
736
+ return None
737
+
738
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
739
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
740
+ # to infer the attention mask.
741
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
742
+ using_static_cache = isinstance(past_key_values, StaticCache)
743
+
744
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
745
+ if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
746
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
747
+ attention_mask,
748
+ inputs_embeds=input_tensor,
749
+ past_key_values_length=past_seen_tokens,
750
+ is_training=self.training,
751
+ ):
752
+ return None
753
+
754
+ dtype, device = input_tensor.dtype, input_tensor.device
755
+ min_dtype = torch.finfo(dtype).min
756
+ sequence_length = input_tensor.shape[1]
757
+ if using_static_cache:
758
+ target_length = past_key_values.get_max_length()
759
+ else:
760
+ target_length = (
761
+ attention_mask.shape[-1]
762
+ if isinstance(attention_mask, torch.Tensor)
763
+ else past_seen_tokens + sequence_length + 1
764
+ )
765
+
766
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
767
+ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position(
768
+ attention_mask,
769
+ sequence_length=sequence_length,
770
+ target_length=target_length,
771
+ dtype=dtype,
772
+ device=device,
773
+ min_dtype=min_dtype,
774
+ cache_position=cache_position,
775
+ batch_size=input_tensor.shape[0],
776
+ )
777
+
778
+ if (
779
+ self.config._attn_implementation == "sdpa"
780
+ and attention_mask is not None
781
+ and attention_mask.device.type == "cuda"
782
+ and not output_attentions
783
+ ):
784
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
785
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
786
+ # Details: https://github.com/pytorch/pytorch/issues/110213
787
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
788
+
789
+ return causal_mask
790
+
791
+
792
+ class PanguProMoEForCausalLM(PanguProMoEPreTrainedModel):
793
+ _tied_weights_keys = ["lm_head.weight"]
794
+
795
+ def __init__(self, config):
796
+ super().__init__(config)
797
+ self.model = PanguProMoEModel(config)
798
+ self.vocab_size = config.vocab_size
799
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
800
+
801
+ self.router_aux_loss_coef = config.router_aux_loss_coef
802
+ self.num_experts = config.num_experts
803
+ self.num_experts_per_tok = config.num_experts_per_tok
804
+ # Initialize weights and apply final processing
805
+ self.post_init()
806
+
807
+ def get_input_embeddings(self):
808
+ return self.model.embed_tokens
809
+
810
+ def set_input_embeddings(self, value):
811
+ self.model.embed_tokens = value
812
+
813
+ def get_output_embeddings(self):
814
+ return self.lm_head
815
+
816
+ def set_output_embeddings(self, new_embeddings):
817
+ self.lm_head = new_embeddings
818
+
819
+ def set_decoder(self, decoder):
820
+ self.model = decoder
821
+
822
+ def get_decoder(self):
823
+ return self.model
824
+
825
+ @replace_return_docstrings(output_type=MoeCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
826
+ def forward(
827
+ self,
828
+ input_ids: torch.LongTensor = None,
829
+ attention_mask: Optional[torch.Tensor] = None,
830
+ position_ids: Optional[torch.LongTensor] = None,
831
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
832
+ inputs_embeds: Optional[torch.FloatTensor] = None,
833
+ labels: Optional[torch.LongTensor] = None,
834
+ use_cache: Optional[bool] = None,
835
+ output_attentions: Optional[bool] = None,
836
+ output_hidden_states: Optional[bool] = None,
837
+ output_router_logits: Optional[bool] = None,
838
+ return_dict: Optional[bool] = None,
839
+ cache_position: Optional[torch.LongTensor] = None,
840
+ ) -> Union[Tuple, MoeCausalLMOutputWithPast]:
841
+ r"""
842
+ Args:
843
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
844
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
845
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
846
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
847
+
848
+ Returns:
849
+
850
+ Example:
851
+
852
+ ```python
853
+ >>> from transformers import AutoTokenizer, PanguProMoEForCausalLM
854
+
855
+ >>> model = PanguProMoEForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
856
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
857
+
858
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
859
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
860
+
861
+ >>> # Generate
862
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
863
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
864
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
865
+ ```"""
866
+
867
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
868
+ output_router_logits = (
869
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
870
+ )
871
+ output_hidden_states = (
872
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
873
+ )
874
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
875
+
876
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
877
+ outputs = self.model(
878
+ input_ids=input_ids,
879
+ attention_mask=attention_mask,
880
+ position_ids=position_ids,
881
+ past_key_values=past_key_values,
882
+ inputs_embeds=inputs_embeds,
883
+ use_cache=use_cache,
884
+ output_attentions=output_attentions,
885
+ output_hidden_states=output_hidden_states,
886
+ output_router_logits=output_router_logits,
887
+ return_dict=return_dict,
888
+ cache_position=cache_position,
889
+ )
890
+
891
+ hidden_states = outputs[0]
892
+ logits = self.lm_head(hidden_states)
893
+ logits = logits.float()
894
+
895
+ loss = None
896
+ if labels is not None:
897
+ # Shift so that tokens < n predict n
898
+ shift_logits = logits[..., :-1, :].contiguous()
899
+ shift_labels = labels[..., 1:].contiguous()
900
+ # Flatten the tokens
901
+ loss_fct = CrossEntropyLoss()
902
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
903
+ shift_labels = shift_labels.view(-1)
904
+ # Enable model parallelism
905
+ shift_labels = shift_labels.to(shift_logits.device)
906
+ loss = loss_fct(shift_logits, shift_labels)
907
+
908
+ aux_loss = None
909
+ if output_router_logits:
910
+ aux_loss = load_balancing_loss_func(
911
+ outputs.router_logits if return_dict else outputs[-1],
912
+ self.num_experts,
913
+ self.num_experts_per_tok,
914
+ attention_mask,
915
+ )
916
+ if labels is not None:
917
+ loss += self.router_aux_loss_coef * aux_loss.to(loss.device) # make sure to reside in the same device
918
+
919
+ if not return_dict:
920
+ output = (logits,) + outputs[1:]
921
+ if output_router_logits:
922
+ output = (aux_loss,) + output
923
+ return (loss,) + output if loss is not None else output
924
+
925
+ return MoeCausalLMOutputWithPast(
926
+ loss=loss,
927
+ aux_loss=aux_loss,
928
+ logits=logits,
929
+ past_key_values=outputs.past_key_values,
930
+ hidden_states=outputs.hidden_states,
931
+ attentions=outputs.attentions,
932
+ router_logits=outputs.router_logits,
933
+ )
934
+
935
+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation
936
+ def prepare_inputs_for_generation(
937
+ self,
938
+ input_ids,
939
+ past_key_values=None,
940
+ attention_mask=None,
941
+ inputs_embeds=None,
942
+ cache_position=None,
943
+ position_ids=None,
944
+ use_cache=True,
945
+ **kwargs,
946
+ ):
947
+ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
948
+ # Exception 1: when passing input_embeds, input_ids may be missing entries
949
+ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
950
+ if past_key_values is not None:
951
+ if inputs_embeds is not None: # Exception 1
952
+ input_ids = input_ids[:, -cache_position.shape[0] :]
953
+ elif input_ids.shape[1] != cache_position.shape[0]: # Default case (the "else", a no op, is Exception 2)
954
+ input_ids = input_ids[:, cache_position]
955
+
956
+ if attention_mask is not None and position_ids is None:
957
+ # create position_ids on the fly for batch generation
958
+ position_ids = attention_mask.long().cumsum(-1) - 1
959
+ position_ids.masked_fill_(attention_mask == 0, 1)
960
+ if past_key_values:
961
+ position_ids = position_ids[:, -input_ids.shape[1] :]
962
+
963
+ # This `clone` call is needed to avoid recapturing cuda graphs with `torch.compile`'s `mode="reduce-overhead`, as otherwise the input `position_ids` would have various stride during the decoding. Here, simply using `.contiguous()` is not sufficient as in the batch size = 1 case, `position_ids` is already contiguous but with varying stride which retriggers a capture.
964
+ position_ids = position_ids.clone(memory_format=torch.contiguous_format)
965
+
966
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
967
+ if inputs_embeds is not None and cache_position[0] == 0:
968
+ model_inputs = {"inputs_embeds": inputs_embeds, "input_ids": None}
969
+ else:
970
+ # The clone here is for the same reason as for `position_ids`.
971
+ model_inputs = {"input_ids": input_ids.clone(memory_format=torch.contiguous_format), "inputs_embeds": None}
972
+
973
+ if isinstance(past_key_values, StaticCache) and attention_mask.ndim == 2:
974
+ if model_inputs["inputs_embeds"] is not None:
975
+ batch_size, sequence_length, _ = model_inputs["inputs_embeds"].shape
976
+ device = model_inputs["inputs_embeds"].device
977
+ else:
978
+ batch_size, sequence_length = model_inputs["input_ids"].shape
979
+ device = model_inputs["input_ids"].device
980
+
981
+ dtype = self.lm_head.weight.dtype
982
+ min_dtype = torch.finfo(dtype).min
983
+
984
+ attention_mask = _prepare_4d_causal_attention_mask_with_cache_position(
985
+ attention_mask,
986
+ sequence_length=sequence_length,
987
+ target_length=past_key_values.get_max_length(),
988
+ dtype=dtype,
989
+ device=device,
990
+ min_dtype=min_dtype,
991
+ cache_position=cache_position,
992
+ batch_size=batch_size,
993
+ )
994
+
995
+ model_inputs.update(
996
+ {
997
+ "position_ids": position_ids,
998
+ "cache_position": cache_position,
999
+ "past_key_values": past_key_values,
1000
+ "use_cache": use_cache,
1001
+ "attention_mask": attention_mask,
1002
+ }
1003
+ )
1004
+ return model_inputs
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "[unused10]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenization_pangu_moe.py ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
3
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
6
+ # and OPT implementations in this library. It has been modified from its
7
+ # original forms to accommodate minor architectural differences compared
8
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+
22
+ import os
23
+ from shutil import copyfile
24
+ from typing import Any, Dict, List, Optional, Tuple
25
+
26
+ import sentencepiece as spm
27
+
28
+ from transformers.tokenization_utils import PreTrainedTokenizer
29
+ from transformers.utils import logging
30
+
31
+
32
+ logger = logging.get_logger(__name__)
33
+
34
+ VOCAB_FILES_NAMES = {"vocab_file": "./tokenizer.model"}
35
+
36
+ PRETRAINED_VOCAB_FILES_MAP = {}
37
+
38
+
39
+ def convert_bool(string):
40
+ if isinstance(string, str):
41
+ if string.lower() == "true":
42
+ return True
43
+ elif string.lower() == "false":
44
+ return False
45
+ else:
46
+ return string
47
+ else:
48
+ return string
49
+
50
+
51
+ class PanguProMoETokenizer(PreTrainedTokenizer):
52
+ """
53
+ Construct a tokenizer. Based on byte-level Byte-Pair-Encoding.
54
+
55
+ Args:
56
+ vocab_file (`str`):
57
+ Path to the vocabulary file.
58
+ """
59
+
60
+ vocab_files_names = VOCAB_FILES_NAMES
61
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
62
+ model_input_names = ["input_ids", "attention_mask"]
63
+ _auto_class = "AutoTokenizer"
64
+
65
+ def __init__(
66
+ self,
67
+ vocab_file,
68
+ unk_token="<unk>",
69
+ bos_token="<s>",
70
+ eos_token="</s>",
71
+ pad_token="</s>",
72
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
73
+ add_bos_token=True,
74
+ add_eos_token=False,
75
+ decode_with_prefix_space=False,
76
+ clean_up_tokenization_spaces=False,
77
+ **kwargs,
78
+ ):
79
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
80
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
81
+ self.sp_model.Load(vocab_file)
82
+ super().__init__(
83
+ bos_token=bos_token,
84
+ eos_token=eos_token,
85
+ unk_token=unk_token,
86
+ pad_token=pad_token,
87
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
88
+ **kwargs,
89
+ )
90
+ self.vocab_file = vocab_file
91
+ self.add_bos_token = convert_bool(add_bos_token)
92
+ self.add_eos_token = add_eos_token
93
+ self.decode_with_prefix_space = decode_with_prefix_space
94
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
95
+ self.sp_model.Load(vocab_file)
96
+ self._no_prefix_space_tokens = None
97
+
98
+ """ Initialisation"""
99
+
100
+ @property
101
+ def no_prefix_space_tokens(self):
102
+ if self._no_prefix_space_tokens is None:
103
+ vocab = self.convert_ids_to_tokens(list(range(self.vocab_size)))
104
+ self._no_prefix_space_tokens = {i for i, tok in enumerate(vocab) if not tok.startswith("▁")}
105
+ return self._no_prefix_space_tokens
106
+
107
+ @property
108
+ def vocab_size(self):
109
+ """Returns vocab size"""
110
+ return self.sp_model.get_piece_size()
111
+
112
+ @property
113
+ def bos_token_id(self) -> Optional[int]:
114
+ return self.sp_model.bos_id()
115
+
116
+ @property
117
+ def eos_token_id(self) -> Optional[int]:
118
+ return super().eos_token_id
119
+
120
+ def get_vocab(self):
121
+ """Returns vocab as a dict"""
122
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
123
+ vocab.update(self.added_tokens_encoder)
124
+ return vocab
125
+
126
+ def _tokenize(self, text):
127
+ """Returns a tokenized string."""
128
+ return self.sp_model.encode(text, out_type=str)
129
+
130
+ def _convert_token_to_id(self, token):
131
+ """Converts a token (str) in an id using the vocab."""
132
+ return self.sp_model.piece_to_id(token)
133
+
134
+ def _convert_id_to_token(self, index):
135
+ """Converts an index (integer) in a token (str) using the vocab."""
136
+ token = self.sp_model.IdToPiece(index)
137
+ return token
138
+
139
+ def _maybe_add_prefix_space(self, tokens, decoded):
140
+ if tokens and tokens[0] not in self.no_prefix_space_tokens:
141
+ return " " + decoded
142
+ else:
143
+ return decoded
144
+
145
+ def convert_tokens_to_string(self, tokens):
146
+ """Converts a sequence of tokens (string) in a single string."""
147
+ current_sub_tokens = []
148
+ out_string = ""
149
+ prev_is_special = False
150
+ for token in tokens:
151
+ # make sure that special tokens are not decoded using sentencepiece model
152
+ if token in self.all_special_tokens:
153
+ # Decode the current sub-tokens first
154
+ if current_sub_tokens:
155
+ out_string += self.sp_model.decode(current_sub_tokens)
156
+ current_sub_tokens = []
157
+ # Append the special token without adding extra spaces
158
+ out_string += token
159
+ prev_is_special = True
160
+ else:
161
+ current_sub_tokens.append(token)
162
+ prev_is_special = False
163
+ # Decode any remaining sub-tokens
164
+ if current_sub_tokens:
165
+ out_string += self.sp_model.decode(current_sub_tokens)
166
+ # Clean up leading and trailing spaces
167
+ if self.clean_up_tokenization_spaces:
168
+ out_string = self.clean_up_tokenization(out_string)
169
+ out_string = self._maybe_add_prefix_space(tokens=tokens, decoded=out_string)
170
+ return out_string[1:]
171
+
172
+ # Override decode to set spaces_between_special_tokens to True as default
173
+ def decode(self,
174
+ token_ids,
175
+ spaces_between_special_tokens: bool = False,
176
+ **kwargs):
177
+ return super().decode(
178
+ token_ids=token_ids,
179
+ spaces_between_special_tokens=spaces_between_special_tokens,
180
+ **kwargs,
181
+ )
182
+
183
+ def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
184
+ """
185
+ Save the vocabulary and special tokens file to a directory.
186
+
187
+ Args:
188
+ save_directory (`str`):
189
+ The directory in which to save the vocabulary.
190
+
191
+ Returns:
192
+ `Tuple(str)`: Paths to the files saved.
193
+ """
194
+ if not os.path.isdir(save_directory):
195
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
196
+ return ("",)
197
+ out_vocab_file = os.path.join(
198
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
199
+ )
200
+
201
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
202
+ copyfile(self.vocab_file, out_vocab_file)
203
+ elif not os.path.isfile(self.vocab_file):
204
+ with open(out_vocab_file, "wb") as fi:
205
+ content_spiece_model = self.sp_model.serialized_model_proto()
206
+ fi.write(content_spiece_model)
207
+
208
+ return (out_vocab_file,)
209
+
210
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
211
+ if self.add_bos_token:
212
+ bos_token_ids = [self.bos_token_id]
213
+ else:
214
+ bos_token_ids = []
215
+
216
+ output = bos_token_ids + token_ids_0
217
+
218
+ if token_ids_1 is not None:
219
+ output = output + token_ids_1
220
+
221
+ if self.add_eos_token:
222
+ output = output + [self.eos_token_id]
223
+
224
+ return output
225
+
226
+ def get_special_tokens_mask(
227
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
228
+ ) -> List[int]:
229
+ """
230
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
231
+ special tokens using the tokenizer `prepare_for_model` method.
232
+
233
+ Args:
234
+ token_ids_0 (`List[int]`):
235
+ List of IDs.
236
+ token_ids_1 (`List[int]`, *optional*):
237
+ Optional second list of IDs for sequence pairs.
238
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
239
+ Whether or not the token list is already formatted with special tokens for the model.
240
+
241
+ Returns:
242
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
243
+ """
244
+ if already_has_special_tokens:
245
+ return super().get_special_tokens_mask(
246
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
247
+ )
248
+
249
+ if token_ids_1 is None:
250
+ return [1] + ([0] * len(token_ids_0)) + [1]
251
+ return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
252
+
253
+ def create_token_type_ids_from_sequences(
254
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
255
+ ) -> List[int]:
256
+ """
257
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make
258
+ use of token type ids, therefore a list of zeros is returned.
259
+
260
+ Args:
261
+ token_ids_0 (`List[int]`):
262
+ List of IDs.
263
+ token_ids_1 (`List[int]`, *optional*):
264
+ Optional second list of IDs for sequence pairs.
265
+
266
+ Returns:
267
+ `List[int]`: List of zeros.
268
+ """
269
+ eos = [self.eos_token_id]
270
+
271
+ if token_ids_1 is None:
272
+ return len(token_ids_0 + eos) * [0]
273
+ return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b16f1558c0cd4ae6ef1a2c605713be0a514f50e1ce2d2c878979ce988c148ec
3
+ size 2477809
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"add_bos_token": true, "add_eos_token": false, "add_prefix_space": true, "added_tokens_decoder": {"0": {"content": "<unk>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "1": {"content": "<s>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "2": {"content": "</s>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45806": {"content": "<|User|>:", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45813": {"content": "<|Bot|>:", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45830": {"content": "[unused0]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45840": {"content": "[unused1]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45846": {"content": "[unused2]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45849": {"content": "[unused3]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45861": {"content": "[unused4]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45866": {"content": "[unused5]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45874": {"content": "[unused6]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45883": {"content": "[unused7]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45884": {"content": "[unused8]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45887": {"content": "[unused9]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45892": {"content": "[unused10]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45920": {"content": "[unused11]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45932": {"content": "[unused12]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45938": {"content": "[unused13]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45953": {"content": "[unused14]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45968": {"content": "[unused15]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45974": {"content": "[unused16]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45982": {"content": "[unused17]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45986": {"content": "[unused18]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46005": {"content": "[unused19]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46007": {"content": "[unused20]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46014": {"content": "[unused21]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46017": {"content": "[unused22]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46028": {"content": "[unused23]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46032": {"content": "[unused24]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46081": {"content": "[unused25]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46086": {"content": "[unused26]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46101": {"content": "[unused27]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46183": {"content": "[unused28]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46230": {"content": "[unused29]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46245": {"content": "[unused30]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46257": {"content": "[unused31]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "144208": {"content": "[unused32]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "144209": {"content": "[unused33]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}}, "auto_map": {"AutoTokenizer": ["tokenization_pangu_moe.PanguProMoETokenizer", null]}, "bos_token": "<s>", "clean_up_tokenization_spaces": false, "eos_token": "[unused10]", "legacy": true, "model_max_length": 1000000000000000019884624838656, "pad_token": null, "sp_model_kwargs": {}, "spaces_between_special_tokens": false, "tokenizer_class": "PanguProMoETokenizer", "unk_token": "<unk>", "use_default_system_prompt": false, "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '[unused9]系统:[unused10]' }}{% endif %}{% if message['role'] == 'system' %}{{ '[unused9]系统:' + message['content'] + '[unused10]' }}{% endif %}{% if message['role'] == 'assistant' %}{{'[unused9]助手:' + message['content'] + '[unused10]'}}{% endif %}{% if message['role'] == 'tool' %}{{'[unused9]工具:' + message['content'] + '[unused10]'}}{% endif %}{% if message['role'] == 'function' %}{{'[unused9]方法:' + message['content'] + '[unused10]'}}{% endif %}{% if message['role'] == 'user' %}{{'[unused9]用户:' + message['content'] + '[unused10]'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '[unused9]助手:' }}{% endif %}"}