intervitens commited on
Commit
a7c11fe
·
verified ·
1 Parent(s): 2fb5072

Upload folder using huggingface_hub

Browse files
Files changed (46) hide show
  1. LICENSE +77 -0
  2. Notice.txt +160 -0
  3. README.md +329 -0
  4. README_CN.md +456 -0
  5. config.json +203 -0
  6. configuration_hunyuan.py +319 -0
  7. generation_config.json +13 -0
  8. hunyuan.py +851 -0
  9. hy.tiktoken +0 -0
  10. model-00001-of-00033.safetensors +3 -0
  11. model-00002-of-00033.safetensors +3 -0
  12. model-00003-of-00033.safetensors +3 -0
  13. model-00004-of-00033.safetensors +3 -0
  14. model-00005-of-00033.safetensors +3 -0
  15. model-00006-of-00033.safetensors +3 -0
  16. model-00007-of-00033.safetensors +3 -0
  17. model-00008-of-00033.safetensors +3 -0
  18. model-00009-of-00033.safetensors +3 -0
  19. model-00010-of-00033.safetensors +3 -0
  20. model-00011-of-00033.safetensors +3 -0
  21. model-00012-of-00033.safetensors +3 -0
  22. model-00013-of-00033.safetensors +3 -0
  23. model-00014-of-00033.safetensors +3 -0
  24. model-00015-of-00033.safetensors +3 -0
  25. model-00016-of-00033.safetensors +3 -0
  26. model-00017-of-00033.safetensors +3 -0
  27. model-00018-of-00033.safetensors +3 -0
  28. model-00019-of-00033.safetensors +3 -0
  29. model-00020-of-00033.safetensors +3 -0
  30. model-00021-of-00033.safetensors +3 -0
  31. model-00022-of-00033.safetensors +3 -0
  32. model-00023-of-00033.safetensors +3 -0
  33. model-00024-of-00033.safetensors +3 -0
  34. model-00025-of-00033.safetensors +3 -0
  35. model-00026-of-00033.safetensors +3 -0
  36. model-00027-of-00033.safetensors +3 -0
  37. model-00028-of-00033.safetensors +3 -0
  38. model-00029-of-00033.safetensors +3 -0
  39. model-00030-of-00033.safetensors +3 -0
  40. model-00031-of-00033.safetensors +3 -0
  41. model-00032-of-00033.safetensors +3 -0
  42. model-00033-of-00033.safetensors +3 -0
  43. model.safetensors.index.json +0 -0
  44. modeling_hunyuan.py +1728 -0
  45. tokenization_hy.py +298 -0
  46. tokenizer_config.json +18 -0
LICENSE ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT
2
+ Tencent Hunyuan A13B Release Date: June 27, 2025
3
+ THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW.
4
+ By clicking to agree or by using, reproducing, modifying, distributing, performing or displaying any portion or element of the Tencent Hunyuan Works, including via any Hosted Service, You will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
5
+ 1. DEFINITIONS.
6
+ a. “Acceptable Use Policy” shall mean the policy made available by Tencent as set forth in the Exhibit A.
7
+ b. “Agreement” shall mean the terms and conditions for use, reproduction, distribution, modification, performance and displaying of Tencent Hunyuan Works or any portion or element thereof set forth herein.
8
+ c. “Documentation” shall mean the specifications, manuals and documentation for Tencent Hunyuan made publicly available by Tencent.
9
+ d. “Hosted Service” shall mean a hosted service offered via an application programming interface (API), web access, or any other electronic or remote means.
10
+ e. “Licensee,” “You” or “Your” shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Tencent Hunyuan Works for any purpose and in any field of use.
11
+ f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
12
+ g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; (ii) works based on Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan or any Model Derivative of Tencent Hunyuan, to that model in order to cause that model to perform similarly to Tencent Hunyuan or a Model Derivative of Tencent Hunyuan, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan or a Model Derivative of Tencent Hunyuan for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
13
+ h. “Output” shall mean the information and/or content output of Tencent Hunyuan or a Model Derivative that results from operating or otherwise using Tencent Hunyuan or a Model Derivative, including via a Hosted Service.
14
+ i. “Tencent,” “We” or “Us” shall mean the applicable entity or entities in the Tencent corporate family that own(s) intellectual property or other rights embodied in or utilized by the Materials.
15
+ j. “Tencent Hunyuan” shall mean the large language models, text/image/video/audio/3D generation models, and multimodal large language models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us, including, without limitation to, Tencent Hunyuan A13B released at [https://github.com/Tencent-Hunyuan/Hunyuan-A13B].
16
+ k. “Tencent Hunyuan Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
17
+ l. “Territory” shall mean the worldwide territory, excluding the territory of the European Union, United Kingdom and South Korea.
18
+ m. “Third Party” or “Third Parties” shall mean individuals or legal entities that are not under common control with Us or You.
19
+ n. “including” shall mean including but not limited to.
20
+ 2. GRANT OF RIGHTS.
21
+ We grant You, for the Territory only, a non-exclusive, non-transferable and royalty-free limited license under Tencent’s intellectual property or other rights owned by Us embodied in or utilized by the Materials to use, reproduce, distribute, create derivative works of (including Model Derivatives), and make modifications to the Materials, only in accordance with the terms of this Agreement and the Acceptable Use Policy, and You must not violate (or encourage or permit anyone else to violate) any term of this Agreement or the Acceptable Use Policy.
22
+ 3. DISTRIBUTION.
23
+ You may, subject to Your compliance with this Agreement, distribute or make available to Third Parties the Tencent Hunyuan Works, exclusively in the Territory, provided that You meet all of the following conditions:
24
+ a. You must provide all such Third Party recipients of the Tencent Hunyuan Works or products or services using them a copy of this Agreement;
25
+ b. You must cause any modified files to carry prominent notices stating that You changed the files;
26
+ c. You are encouraged to: (i) publish at least one technology introduction blogpost or one public statement expressing Your experience of using the Tencent Hunyuan Works; and (ii) mark the products or services developed by using the Tencent Hunyuan Works to indicate that the product/service is “Powered by Tencent Hunyuan”; and
27
+ d. All distributions to Third Parties (other than through a Hosted Service) must be accompanied by a “Notice” text file that contains the following notice: “Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright © 2025 Tencent. All Rights Reserved. The trademark rights of “Tencent Hunyuan” are owned by Tencent or its affiliate.”
28
+ You may add Your own copyright statement to Your modifications and, except as set forth in this Section and in Section 5, may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Model Derivatives as a whole, provided Your use, reproduction, modification, distribution, performance and display of the work otherwise complies with the terms and conditions of this Agreement (including as regards the Territory). If You receive Tencent Hunyuan Works from a Licensee as part of an integrated end user product, then this Section 3 of this Agreement will not apply to You.
29
+ 4. ADDITIONAL COMMERCIAL TERMS.
30
+ If, on the Tencent Hunyuan version release date, the monthly active users of all products or services made available by or for Licensee is greater than 100 million monthly active users in the preceding calendar month, You must request a license from Tencent, which Tencent may grant to You in its sole discretion, and You are not authorized to exercise any of the rights under this Agreement unless or until Tencent otherwise expressly grants You such rights.
31
+ 5. RULES OF USE.
32
+ a. Your use of the Tencent Hunyuan Works must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Tencent Hunyuan Works, which is hereby incorporated by reference into this Agreement. You must include the use restrictions referenced in these Sections 5(a) and 5(b) as an enforceable provision in any agreement (e.g., license agreement, terms of use, etc.) governing the use and/or distribution of Tencent Hunyuan Works and You must provide notice to subsequent users to whom You distribute that Tencent Hunyuan Works are subject to the use restrictions in these Sections 5(a) and 5(b).
33
+ b. You must not use the Tencent Hunyuan Works or any Output or results of the Tencent Hunyuan Works to improve any other AI model (other than Tencent Hunyuan or Model Derivatives thereof).
34
+ c. You must not use, reproduce, modify, distribute, or display the Tencent Hunyuan Works, Output or results of the Tencent Hunyuan Works outside the Territory. Any such use outside the Territory is unlicensed and unauthorized under this Agreement.
35
+ 6. INTELLECTUAL PROPERTY.
36
+ a. Subject to Tencent’s ownership of Tencent Hunyuan Works made by or for Tencent and intellectual property rights therein, conditioned upon Your compliance with the terms and conditions of this Agreement, as between You and Tencent, You will be the owner of any derivative works and modifications of the Materials and any Model Derivatives that are made by or for You.
37
+ b. No trademark licenses are granted under this Agreement, and in connection with the Tencent Hunyuan Works, Licensee may not use any name or mark owned by or associated with Tencent or any of its affiliates, except as required for reasonable and customary use in describing and distributing the Tencent Hunyuan Works. Tencent hereby grants You a license to use “Tencent Hunyuan” (the “Mark”) in the Territory solely as required to comply with the provisions of Section 3(c), provided that You comply with any applicable laws related to trademark protection. All goodwill arising out of Your use of the Mark will inure to the benefit of Tencent.
38
+ c. If You commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any person or entity alleging that the Materials or any Output, or any portion of any of the foregoing, infringe any intellectual property or other right owned or licensable by You, then all licenses granted to You under this Agreement shall terminate as of the date such lawsuit or other proceeding is filed. You will defend, indemnify and hold harmless Us from and against any claim by any Third Party arising out of or related to Your or the Third Party’s use or distribution of the Tencent Hunyuan Works.
39
+ d. Tencent claims no rights in Outputs You generate. You and Your users are solely responsible for Outputs and their subsequent uses.
40
+ 7. DISCLAIMERS OF WARRANTY AND LIMITATIONS OF LIABILITY.
41
+ a. We are not obligated to support, update, provide training for, or develop any further version of the Tencent Hunyuan Works or to grant any license thereto.
42
+ b. UNLESS AND ONLY TO THE EXTENT REQUIRED BY APPLICABLE LAW, THE TENCENT HUNYUAN WORKS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED “AS IS” WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES OF ANY KIND INCLUDING ANY WARRANTIES OF TITLE, MERCHANTABILITY, NONINFRINGEMENT, COURSE OF DEALING, USAGE OF TRADE, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING, REPRODUCING, MODIFYING, PERFORMING, DISPLAYING OR DISTRIBUTING ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS AND ASSUME ANY AND ALL RISKS ASSOCIATED WITH YOUR OR A THIRD PARTY’S USE OR DISTRIBUTION OF ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS AND YOUR EXERCISE OF RIGHTS AND PERMISSIONS UNDER THIS AGREEMENT.
43
+ c. TO THE FULLEST EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL TENCENT OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, FOR ANY DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, CONSEQUENTIAL OR PUNITIVE DAMAGES, OR LOST PROFITS OF ANY KIND ARISING FROM THIS AGREEMENT OR RELATED TO ANY OF THE TENCENT HUNYUAN WORKS OR OUTPUTS, EVEN IF TENCENT OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
44
+ 8. SURVIVAL AND TERMINATION.
45
+ a. The term of this Agreement shall commence upon Your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
46
+ b. We may terminate this Agreement if You breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, You must promptly delete and cease use of the Tencent Hunyuan Works. Sections 6(a), 6(c), 7 and 9 shall survive the termination of this Agreement.
47
+ 9. GOVERNING LAW AND JURISDICTION.
48
+ a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of the Hong Kong Special Administrative Region of the People’s Republic of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
49
+ b. Exclusive jurisdiction and venue for any dispute arising out of or relating to this Agreement will be a court of competent jurisdiction in the Hong Kong Special Administrative Region of the People’s Republic of China, and Tencent and Licensee consent to the exclusive jurisdiction of such court with respect to any such dispute.
50
+
51
+ EXHIBIT A
52
+ ACCEPTABLE USE POLICY
53
+
54
+ Tencent reserves the right to update this Acceptable Use Policy from time to time.
55
+ Last modified: November 5, 2024
56
+
57
+ Tencent endeavors to promote safe and fair use of its tools and features, including Tencent Hunyuan. You agree not to use Tencent Hunyuan or Model Derivatives:
58
+ 1. Outside the Territory;
59
+ 2. In any way that violates any applicable national, federal, state, local, international or any other law or regulation;
60
+ 3. To harm Yourself or others;
61
+ 4. To repurpose or distribute output from Tencent Hunyuan or any Model Derivatives to harm Yourself or others;
62
+ 5. To override or circumvent the safety guardrails and safeguards We have put in place;
63
+ 6. For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
64
+ 7. To generate or disseminate verifiably false information and/or content with the purpose of harming others or influencing elections;
65
+ 8. To generate or facilitate false online engagement, including fake reviews and other means of fake online engagement;
66
+ 9. To intentionally defame, disparage or otherwise harass others;
67
+ 10. To generate and/or disseminate malware (including ransomware) or any other content to be used for the purpose of harming electronic systems;
68
+ 11. To generate or disseminate personal identifiable information with the purpose of harming others;
69
+ 12. To generate or disseminate information (including images, code, posts, articles), and place the information in any public context (including –through the use of bot generated tweets), without expressly and conspicuously identifying that the information and/or content is machine generated;
70
+ 13. To impersonate another individual without consent, authorization, or legal right;
71
+ 14. To make high-stakes automated decisions in domains that affect an individual’s safety, rights or wellbeing (e.g., law enforcement, migration, medicine/health, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance);
72
+ 15. In a manner that violates or disrespects the social ethics and moral standards of other countries or regions;
73
+ 16. To perform, facilitate, threaten, incite, plan, promote or encourage violent extremism or terrorism;
74
+ 17. For any use intended to discriminate against or harm individuals or groups based on protected characteristics or categories, online or offline social behavior or known or predicted personal or personality characteristics;
75
+ 18. To intentionally exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
76
+ 19. For military purposes;
77
+ 20. To engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or other professional practices.
Notice.txt ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Usage and Legal Notices:
2
+
3
+ Tencent is pleased to support the open source community by making Tencent Hunyuan A13B available.
4
+
5
+ Copyright (C) Tencent. All rights reserved. The below software and/or models in this distribution may have been modified by Tencent ("Tencent Modifications"). All Tencent Modifications are Copyright (C) Tencent.
6
+
7
+ Tencent Hunyuan A13B is licensed under TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT, which can be found in this repository called "LICENSE", except for the third-party components listed below. Tencent Hunyuan A13B does not impose any additional limitations beyond what is outlined in the respective licenses of these third-party components. Users must comply with all terms and conditions of original licenses of these third-party components and must ensure that the usage of the third party components adheres to all relevant laws and regulations.
8
+
9
+ For avoidance of doubts, Tencent Hunyuan A13B refers to the inference code, training code, parameters and the weights of Tencent Hunyuan A13B only, which are made publicly available by Tencent in accordance with the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT.
10
+
11
+
12
+ Other dependencies and licenses:
13
+
14
+
15
+ Open Source Software Licensed under the Apache License Version 2.0:
16
+ The below software in this distribution may have been modified by Tencent ("Tencent Modifications"). All Tencent Modifications are Copyright (C) 2025 Tencent.
17
+ --------------------------------------------------------------------
18
+ 1. pytorch
19
+ Copyright 2016-2017 TorchAPI
20
+ Copyright 2016-2017 Contributors
21
+
22
+ 2. VLLM
23
+ Copyright (c) vllm original author and authors
24
+ Please note this software has been modified by Tencent in this distribution.
25
+
26
+ 3. transformers
27
+ Copyright 2018- The Hugging Face team. All rights reserved.
28
+
29
+ 4. accelerate
30
+ Copyright (c) accelerate original author and authors
31
+
32
+
33
+ Terms of the Apache License Version 2.0:
34
+ --------------------------------------------------------------------
35
+ Apache License
36
+
37
+ Version 2.0, January 2004
38
+
39
+ http://www.apache.org/licenses/
40
+
41
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
42
+ 1. Definitions.
43
+
44
+ "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
45
+
46
+ "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
47
+
48
+ "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
49
+
50
+ "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
51
+
52
+ "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
53
+
54
+ "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
55
+
56
+ "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
57
+
58
+ "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
59
+
60
+ "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
63
+
64
+ 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
65
+
66
+ 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
67
+
68
+ 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
69
+
70
+ You must give any other recipients of the Work or Derivative Works a copy of this License; and
71
+
72
+ You must cause any modified files to carry prominent notices stating that You changed the files; and
73
+
74
+ You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
75
+
76
+ If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
77
+
78
+ You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
79
+
80
+ 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
81
+
82
+ 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
83
+
84
+ 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
85
+
86
+ 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
87
+
88
+ 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
89
+
90
+ END OF TERMS AND CONDITIONS
91
+
92
+
93
+
94
+ Open Source Software Licensed under the BSD 3-Clause License and Other Licenses of the Third-Party Components therein:
95
+ --------------------------------------------------------------------
96
+ 1. pytorch
97
+ Copyright (c) 2016- Facebook, Inc (Adam Paszke)
98
+ Copyright (c) 2014- Facebook, Inc (Soumith Chintala)
99
+ Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
100
+ Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
101
+ Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
102
+ Copyright (c) 2011-2013 NYU (Clement Farabet)
103
+ Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
104
+ Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
105
+ Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
106
+
107
+
108
+ Terms of the BSD 3-Clause:
109
+ --------------------------------------------------------------------
110
+ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
111
+
112
+ 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
113
+
114
+ 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
115
+
116
+ 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
117
+
118
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
119
+
120
+ For the license of other third party components, please refer to the following URL:
121
+ https://github.com/pytorch/pytorch/blob/v2.1.1/NOTICE
122
+ https://github.com/pytorch/pytorch/tree/v2.1.1/third_party
123
+
124
+
125
+ Open Source Software Licensed under the BSD 3-Clause License:
126
+ --------------------------------------------------------------------
127
+ 1. flash_attn
128
+ Copyright (c) 2022, the respective contributors, as shown by the AUTHORS file.
129
+ All rights reserved.
130
+
131
+
132
+ A copy of the BSD 3-Clause is included in this file.
133
+
134
+
135
+
136
+ Open Source Software Licensed under the Apache License Version 2.0 and Other Licenses of the Third-Party Components therein:
137
+ The below software in this distribution is modified by Tencent ("Tencent Modifications"). All Tencent Modifications are Copyright (C) 2025 Tencent.
138
+ --------------------------------------------------------------------
139
+ 1. sglang
140
+ Copyright 2023-2024 SGLang Team
141
+
142
+
143
+ A copy of the Apache 2.0 is included in this file.
144
+
145
+ For the license of other third party components, please refer to the following URL:
146
+ https://github.com/sgl-project/sglang/tree/v0.4.7/3rdparty/amd
147
+
148
+
149
+
150
+ Open Source Software Licensed under the Apache License Version 2.0 and Other Licenses of the Third-Party Components therein:
151
+ The below software in this distribution is modified by Tencent ("Tencent Modifications"). All Tencent Modifications are Copyright (C) 2025 Tencent.
152
+ --------------------------------------------------------------------
153
+ 1. TensorRT-LLM
154
+ Copyright (c) TensorRT-LLM original author and authors
155
+
156
+
157
+ A copy of the Apache 2.0 is included in this file.
158
+
159
+ For the license of other third party components, please refer to the following URL:
160
+ https://github.com/NVIDIA/TensorRT-LLM/tree/v0.20.0/3rdparty
README.md ADDED
@@ -0,0 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: tencent-hunyuan-a13b
4
+ license_link: https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/LICENSE
5
+ library_name: transformers
6
+ ---
7
+
8
+ <p align="center">
9
+ <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
10
+ </p><p></p>
11
+
12
+
13
+ <p align="center">
14
+ 🤗&nbsp;<a href="https://huggingface.co/tencent/Hunyuan-A13B-Instruct"><b>Hugging Face</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
15
+ 🖥️&nbsp;<a href="https://hunyuan.tencent.com" style="color: red;"><b>Official Website</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
16
+ 🕖&nbsp;<a href="https://cloud.tencent.com/product/hunyuan"><b>HunyuanAPI</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
17
+ 🕹️&nbsp;<a href="https://hunyuan.tencent.com/?model=hunyuan-a13b"><b>Demo</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
18
+ 🤖&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct"><b>ModelScope</b></a>
19
+ </p>
20
+
21
+
22
+ <p align="center">
23
+ <a href="https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/report/Hunyuan_A13B_Technical_Report.pdf"><b>Technical Report</b> </a> |
24
+ <a href="https://github.com/Tencent-Hunyuan/Hunyuan-A13B"><b>GITHUB</b></a> |
25
+ <a href="https://cnb.cool/tencent/hunyuan/Hunyuan-A13B"><b>cnb.cool</b></a> |
26
+ <a href="https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/LICENSE"><b>LICENSE</b></a>
27
+ </p>
28
+
29
+
30
+
31
+ Welcome to the official repository of **Hunyuan-A13B**, an innovative and open-source large language model (LLM) built on a fine-grained Mixture-of-Experts (MoE) architecture. Designed for efficiency and scalability, Hunyuan-A13B delivers cutting-edge performance with minimal computational overhead, making it an ideal choice for advanced reasoning and general-purpose applications, especially in resource-constrained environments.
32
+
33
+ ## Model Introduction
34
+
35
+ With the rapid advancement of artificial intelligence technology, large language models (LLMs) have achieved remarkable progress in natural language processing, computer vision, and scientific tasks. However, as model scales continue to expand, optimizing resource consumption while maintaining high performance has become a critical challenge. To address this, we have explored Mixture of Experts (MoE) architectures. The newly introduced Hunyuan-A13B model features a total of 80 billion parameters with 13 billion active parameters. It not only delivers high-performance results but also achieves optimal resource efficiency, successfully balancing computational power and resource utilization.
36
+
37
+ ### Key Features and Advantages
38
+
39
+ - **Compact yet Powerful**: With only 13 billion active parameters (out of a total of 80 billion), the model delivers competitive performance on a wide range of benchmark tasks, rivaling much larger models.
40
+ - **Hybrid Reasoning Support**: Supports both fast and slow thinking modes, allowing users to flexibly choose according to their needs.
41
+ - **Ultra-Long Context Understanding**: Natively supports a 256K context window, maintaining stable performance on long-text tasks.
42
+ - **Enhanced Agent Capabilities**: Optimized for agent tasks, achieving leading results on benchmarks such as BFCL-v3, τ-Bench and C3-Bench.
43
+ - **Efficient Inference**: Utilizes Grouped Query Attention (GQA) and supports multiple quantization formats, enabling highly efficient inference.
44
+
45
+ ### Why Choose Hunyuan-A13B?
46
+
47
+ As a powerful yet computationally efficient large model, Hunyuan-A13B is an ideal choice for researchers and developers seeking high performance under resource constraints. Whether for academic research, cost-effective AI solution development, or innovative application exploration, this model provides a robust foundation for advancement.
48
+
49
+ &nbsp;
50
+
51
+ ## Related News
52
+ * 2025.6.27 We have open-sourced **Hunyuan-A13B-Pretrain** , **Hunyuan-A13B-Instruct** , **Hunyuan-A13B-Instruct-FP8** , **Hunyuan-A13B-Instruct-GPTQ-Int4** on Hugging Face. In addition, we have released a <a href="report/Hunyuan_A13B_Technical_Report.pdf">technical report </a> and a training and inference operation manual, which provide detailed information about the model’s capabilities as well as the operations for training and inference.
53
+
54
+ <br>
55
+
56
+
57
+ ## Benchmark
58
+
59
+ Note: The following benchmarks are evaluated by TRT-LLM-backend on several **base models**.
60
+
61
+ | Model | Hunyuan-Large | Qwen2.5-72B | Qwen3-A22B | Hunyuan-A13B |
62
+ |------------------|---------------|--------------|-------------|---------------|
63
+ | MMLU | 88.40 | 86.10 | 87.81 | 88.17 |
64
+ | MMLU-Pro | 60.20 | 58.10 | 68.18 | 67.23 |
65
+ | MMLU-Redux | 87.47 | 83.90 | 87.40 | 87.67 |
66
+ | BBH | 86.30 | 85.80 | 88.87 | 87.56 |
67
+ | SuperGPQA | 38.90 | 36.20 | 44.06 | 41.32 |
68
+ | EvalPlus | 75.69 | 65.93 | 77.60 | 78.64 |
69
+ | MultiPL-E | 59.13 | 60.50 | 65.94 | 69.33 |
70
+ | MBPP | 72.60 | 76.00 | 81.40 | 83.86 |
71
+ | CRUX-I | 57.00 | 57.63 | - | 70.13 |
72
+ | CRUX-O | 60.63 | 66.20 | 79.00 | 77.00 |
73
+ | MATH | 69.80 | 62.12 | 71.84 | 72.35 |
74
+ | CMATH | 91.30 | 84.80 | - | 91.17 |
75
+ | GSM8k | 92.80 | 91.50 | 94.39 | 91.83 |
76
+ | GPQA | 25.18 | 45.90 | 47.47 | 49.12 |
77
+
78
+
79
+ Hunyuan-A13B-Instruct has achieved highly competitive performance across multiple benchmarks, particularly in mathematics, science, agent domains, and more. We compared it with several powerful models, and the results are shown below.
80
+
81
+ | Topic | Bench | OpenAI-o1-1217 | DeepSeek R1 | Qwen3-A22B | Hunyuan-A13B-Instruct |
82
+ |:-------------------:|:----------------------------------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
83
+ | **Mathematics** | AIME 2024<br>AIME 2025<br>MATH | 74.3<br>79.2<br>96.4 | 79.8<br>70<br>94.9 | 85.7<br>81.5<br>94.0 | 87.3<br>76.8<br>94.3 |
84
+ | **Science** | GPQA-Diamond<br>OlympiadBench | 78<br>83.1 | 71.5<br>82.4 | 71.1<br>85.7 | 71.2<br>82.7 |
85
+ | **Coding** | Livecodebench<br>Fullstackbench<br>ArtifactsBench | 63.9<br>64.6<br>38.6 | 65.9<br>71.6<br>44.6 | 70.7<br>65.6<br>44.6 | 63.9<br>67.8<br>43 |
86
+ | **Reasoning** | BBH<br>DROP<br>ZebraLogic | 80.4<br>90.2<br>81 | 83.7<br>92.2<br>78.7 | 88.9<br>90.3<br>80.3 | 89.1<br>91.1<br>84.7 |
87
+ | **Instruction<br>Following** | IF-Eval<br>SysBench | 91.8<br>82.5 | 88.3<br>77.7 | 83.4<br>74.2 | 84.7<br>76.1 |
88
+ | **Text<br>Creation**| LengthCtrl<br>InsCtrl | 60.1<br>74.8 | 55.9<br>69 | 53.3<br>73.7 | 55.4<br>71.9 |
89
+ | **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
90
+ | **Agent** | BFCL v3<br> τ-Bench<br>ComplexFuncBench<br> C3-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
91
+
92
+
93
+ &nbsp;
94
+
95
+ ## Use with transformers
96
+
97
+ Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning.
98
+ 1. Pass "enable_thinking=False" when calling apply_chat_template.
99
+ 2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
100
+
101
+ The following code snippet shows how to use the transformers library to load and apply the model.
102
+ It also demonstrates how to enable and disable the reasoning mode ,
103
+ and how to parse the reasoning process along with the final output.
104
+
105
+
106
+
107
+ ```python
108
+ from transformers import AutoModelForCausalLM, AutoTokenizer
109
+ import os
110
+ import re
111
+
112
+ model_name_or_path = os.environ['MODEL_PATH']
113
+ # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
114
+
115
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
116
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
117
+ messages = [
118
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
119
+ ]
120
+ tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
121
+ enable_thinking=True # Toggle thinking mode (default: True)
122
+ )
123
+
124
+ outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
125
+
126
+ output_text = tokenizer.decode(outputs[0])
127
+
128
+ think_pattern = r'<think>(.*?)</think>'
129
+ think_matches = re.findall(think_pattern, output_text, re.DOTALL)
130
+
131
+ answer_pattern = r'<answer>(.*?)</answer>'
132
+ answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
133
+
134
+ think_content = [match.strip() for match in think_matches][0]
135
+ answer_content = [match.strip() for match in answer_matches][0]
136
+ print(f"thinking_content:{think_content}\n\n")
137
+ print(f"answer_content:{answer_content}\n\n")
138
+ ```
139
+
140
+ ### Fast and slow thinking switch
141
+
142
+ This model supports two modes of operation:
143
+
144
+ - Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
145
+ - Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
146
+
147
+ **Switching to Fast Thinking Mode:**
148
+
149
+ To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
150
+ ```
151
+ tokenized_chat = tokenizer.apply_chat_template(
152
+ messages,
153
+ tokenize=True,
154
+ add_generation_prompt=True,
155
+ return_tensors="pt",
156
+ enable_thinking=False # Use fast thinking mode
157
+ )
158
+ ```
159
+
160
+
161
+
162
+ ## Deployment
163
+
164
+ For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
165
+
166
+ image: https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags
167
+
168
+
169
+ ### TensorRT-LLM
170
+
171
+ #### Docker Image
172
+
173
+ We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
174
+
175
+ - To get started:
176
+
177
+ https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
178
+
179
+ ```
180
+ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
181
+ ```
182
+ ```
183
+ docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
184
+ ```
185
+
186
+ - Prepare Configuration file:
187
+
188
+ ```
189
+ cat >/path/to/extra-llm-api-config.yml <<EOF
190
+ use_cuda_graph: true
191
+ cuda_graph_padding_enabled: true
192
+ cuda_graph_batch_sizes:
193
+ - 1
194
+ - 2
195
+ - 4
196
+ - 8
197
+ - 16
198
+ - 32
199
+ print_iter_log: true
200
+ EOF
201
+ ```
202
+
203
+
204
+ - Start the API server:
205
+
206
+
207
+ ```
208
+ trtllm-serve \
209
+ /path/to/HunYuan-moe-A13B \
210
+ --host localhost \
211
+ --port 8000 \
212
+ --backend pytorch \
213
+ --max_batch_size 32 \
214
+ --max_num_tokens 16384 \
215
+ --tp_size 2 \
216
+ --kv_cache_free_gpu_memory_fraction 0.6 \
217
+ --trust_remote_code \
218
+ --extra_llm_api_options /path/to/extra-llm-api-config.yml
219
+ ```
220
+
221
+
222
+ ### vLLM
223
+
224
+ #### Docker Image
225
+ We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
226
+
227
+ - To get started:
228
+
229
+ ```
230
+ docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm
231
+ or
232
+ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
233
+ ```
234
+
235
+ - Download Model file:
236
+ - Huggingface: will download automicly by vllm.
237
+ - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
238
+
239
+
240
+ - Start the API server:
241
+
242
+ model download by huggingface:
243
+ ```
244
+ docker run --rm --ipc=host \
245
+ -v ~/.cache:/root/.cache/ \
246
+ --security-opt seccomp=unconfined \
247
+ --net=host \
248
+ --gpus=all \
249
+ -it \
250
+ -e VLLM_USE_V1=0 \
251
+ --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
252
+ -m vllm.entrypoints.openai.api_server \
253
+ --host 0.0.0.0 \
254
+ --tensor-parallel-size 4 \
255
+ --port 8000 \
256
+ --model tencent/Hunyuan-A13B-Instruct \
257
+ --trust_remote_code
258
+ ```
259
+
260
+ model downloaded by modelscope:
261
+ ```
262
+ docker run --rm --ipc=host \
263
+ -v ~/.cache/modelscope:/root/.cache/modelscope \
264
+ --security-opt seccomp=unconfined \
265
+ --net=host \
266
+ --gpus=all \
267
+ -it \
268
+ -e VLLM_USE_V1=0 \
269
+ --entrypoint python mirror.ccs.tencentyun.com/hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm \
270
+ -m vllm.entrypoints.openai.api_server \
271
+ --host 0.0.0.0 \
272
+ --tensor-parallel-size 4 \
273
+ --port 8000 \
274
+ --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ \
275
+ --trust_remote_code
276
+ ```
277
+
278
+
279
+ #### Tool Calling with vLLM
280
+
281
+ To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
282
+
283
+ For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
284
+ 🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
285
+
286
+ When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
287
+
288
+ | Parameter | Value |
289
+ |--------------------------|-----------------------------------------------------------------------|
290
+ | `--tool-parser-plugin` | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
291
+ | `--tool-call-parser` | `hunyuan` |
292
+
293
+ These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
294
+
295
+ ### Reasoning parser
296
+
297
+ vLLM reasoning parser support on Hunyuan A13B model is under development.
298
+
299
+ ### SGLang
300
+
301
+ #### Docker Image
302
+
303
+ We also provide a pre-built Docker image based on the latest version of SGLang.
304
+
305
+ To get started:
306
+
307
+ - Pull the Docker image
308
+
309
+ ```
310
+ docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
311
+ or
312
+ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
313
+ ```
314
+
315
+ - Start the API server:
316
+
317
+ ```
318
+ docker run --gpus all \
319
+ --shm-size 32g \
320
+ -p 30000:30000 \
321
+ --ipc=host \
322
+ docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
323
+ -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
324
+ ```
325
+
326
+
327
+ ## Contact Us
328
+
329
+ If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email ([email protected]).
README_CN.md ADDED
@@ -0,0 +1,456 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
3
+ </p><p></p>
4
+
5
+ <p align="center">
6
+ 🫣&nbsp;<a href="https://huggingface.co/tencent/Hunyuan-A13B-Instruct"><b>Hugging Face</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
7
+ 🖥️&nbsp;<a href="https://llm.hunyuan.tencent.com/" style="color: red;"><b>Official Website</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
8
+ 🕖&nbsp;<a href="https://cloud.tencent.com/product/hunyuan"><b>HunyuanAPI</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
9
+ 🕹️&nbsp;<a href="https://hunyuan.tencent.com/?model=hunyuan-a13b"><b>Demo</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
10
+ <img src="https://avatars.githubusercontent.com/u/109945100?s=200&v=4" width="16"/>&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct"><b>ModelScope</b></a>
11
+ </p>
12
+
13
+ <p align="center">
14
+ <a href="https://github.com/Tencent/Hunyuan-A13B"><b>GITHUB</b></a>
15
+ </p>
16
+
17
+
18
+
19
+
20
+ ## 模型介绍
21
+
22
+ 随着人工智能技术的快速发展,大型语言模型(LLMs)在自然语言处理、计算机视觉和科学任务等领域取得了显著进展。然而,随着模型规模的扩大,如何在保持高性能的同时优化资源消耗成为一个关键挑战。为了应对这一挑战,我们研究了混合专家(MoE)模型,当前亮相的 Hunyuan-A13B 模型,拥有800亿总参数和130亿激活参数。不仅在效果上达到了高标准,而且在尺寸上也做到了极致的优化,成功平衡了模型性能与资源占用。
23
+
24
+
25
+ ### 核心特性与优势
26
+ - ​**小参数量,高性能**​:仅激活130亿参数(总参数量800亿),即可在多样化基准任务中媲美更大规模模型的竞争力表现
27
+ - ​**混合推理支持**​:同时支持快思考和慢思考两种模式,支持用户灵活选择
28
+ - ​**超长上下文理解**​:原生支持256K上下文窗口,在长文本任务中保持稳定性能
29
+ - ​**增强Agent能力**​:优化Agent能力,在BFCL-v3、τ-Bench等智能体基准测试中领先
30
+ - ​**高效推理**​:采用分组查询注意力(GQA)策略,支持多量化格式,实现高效推理
31
+
32
+
33
+ ### 为何选择Hunyuan-A13B?
34
+ 作为兼具强大性能与计算效率的大模型,Hunyuan-A13B是研究者与开发者在资源受限条件下追求高性能的理想选择。无论学术研究、高性价比AI解决方案开发,还是创新应用探索,本模型都能提供强大的基础支持。
35
+
36
+
37
+ &nbsp;
38
+
39
+ ## 新闻
40
+ <br>
41
+
42
+ * 2025.6.26 我们在Hugging Face开源了 **Hunyuan-A13B-Instruct**,**Hunyuan-A13B-Pretrain**, **Hunyuan-A13B-Instruct-FP8**, **Hunyuan-A13B-Instruct-GPTQ-Int4**。并发布了技术报告和训练推理操作手册,详细介绍了模型能力和训练与推理的操作。
43
+
44
+ ## 模型结构
45
+
46
+ Hunyuan-A13B采用了细粒度混合专家(Fine-grained Mixture of Experts,Fine-grained MoE)架构,包含800亿参数和130亿激活参数,累计训练了超过 20T tokens。该模型支持 256K 的上下文长度,以下为模型结构细节:
47
+ * 总参数: 80B
48
+ * 激活参数: 13B
49
+ * 层数: 32
50
+ * Attention Heads: 32
51
+ * 共享专家数: 1
52
+ * 非共享专家数: 64
53
+ * 路由策略: Top-8
54
+ * 激活函数: SwiGLU
55
+ * 隐层维度: 4096
56
+ * 专家隐层维度: 3072
57
+
58
+ ## Benchmark评估榜单
59
+
60
+ **Hunyuan-A13B-Pretrain** 在 12/14 个任务上超越了Hunyuan上一代52B激活参数的MoE模型Hunyuan-Large,证实了它在预训练任务上出色的能力。与业界更大参数量的Dense和MoE模型相比, Hunyuan-A13B在多个代码和数学任务上都取得了最高分数。在MMLU, MMLU-PRO等诸多众聚合任务上, Hunyuan-A13B达到了与Qwen3-A22B模型同等的水平,表现出优秀的综合能力。
61
+
62
+ | Model | Hunyuan-Large | Qwen2.5-72B | Qwen3-A22B | Hunyuan-A13B |
63
+ |------------------|---------------|--------------|-------------|---------------|
64
+ | MMLU | 88.40 | 86.10 | 87.81 | 88.17 |
65
+ | MMLU-Pro | 60.20 | 58.10 | 68.18 | 67.23 |
66
+ | MMLU-Redux | 87.47 | 83.90 | 87.40 | 87.67 |
67
+ | BBH | 86.30 | 85.80 | 88.87 | 87.56 |
68
+ | SuperGPQA | 38.90 | 36.20 | 44.06 | 41.32 |
69
+ | EvalPlus | 75.69 | 65.93 | 77.60 | 78.64 |
70
+ | MultiPL-E | 59.13 | 60.50 | 65.94 | 69.33 |
71
+ | MBPP | 72.60 | 76.00 | 81.40 | 83.86 |
72
+ | CRUX-I | 57.00 | 57.63 | - | 70.13 |
73
+ | CRUX-O | 60.63 | 66.20 | 79.00 | 77.00 |
74
+ | MATH | 69.80 | 62.12 | 71.84 | 72.35 |
75
+ | CMATH | 91.30 | 84.80 | - | 91.17 |
76
+ | GSM8k | 92.80 | 91.50 | 94.39 | 91.83 |
77
+ | GPQA | 25.18 | 45.90 | 47.47 | 49.12 |
78
+
79
+ **Hunyuan-A13B-Instruct** 在多项基准测试中取得了极具有竞争力的表现,尤其是在数学、科学、agent等领域。我们与一些强力模型进行了对比,结果如下所示。
80
+
81
+ | Topic | Bench | OpenAI-o1-1217 | DeepSeek R1 | Qwen3-A22B | Hunyuan-A13B-Instruct |
82
+ |:-------------------:|:-----------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
83
+ | **Mathematics** | AIME 2024<br>AIME 2025<br>MATH | 74.3<br>79.2<br>96.4 | 79.8<br>70<br>94.9 | 85.7<br>81.5<br>94.0 | 87.3<br>76.8<br>94.3 |
84
+ | **Science** | GPQA-Diamond<br>OlympiadBench | 78<br>83.1 | 71.5<br>82.4 | 71.1<br>85.7 | 71.2<br>82.7 |
85
+ | **Coding** | Livecodebench<br>Fullstackbench<br>ArtifactsBench | 63.9<br>64.6<br>38.6 | 65.9<br>71.6<br>44.6 | 70.7<br>65.6<br>44.6 | 63.9<br>67.8<br>43 |
86
+ | **Reasoning** | BBH<br>DROP<br>ZebraLogic | 80.4<br>90.2<br>81 | 83.7<br>92.2<br>78.7 | 88.9<br>90.3<br>80.3 | 89.1<br>91.1<br>84.7 |
87
+ | **Instruction<br>Following** | IF-Eval<br>SysBench | 91.8<br>82.5 | 88.3<br>77.7 | 83.4<br>74.2 | 84.7<br>76.1 |
88
+ | **Text<br>Creation**| LengthCtrl<br>InsCtrl | 60.1<br>74.8 | 55.9<br>69 | 53.3<br>73.7 | 55.4<br>71.9 |
89
+ | **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
90
+ | **Agent** | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
91
+
92
+
93
+ ## 推理和部署
94
+
95
+ HunyuanLLM可以采用vLLM,sglang或TensorRT-LLM部署。为了简化部署过程HunyuanLLM提供了预构建docker镜像。
96
+
97
+
98
+ ## 使用TensorRT-LLM推理
99
+
100
+ ### BF16部署
101
+
102
+ #### Step1:执行推理
103
+
104
+ #### 方式1:命令行推理
105
+
106
+ 下面我们展示一个代码片段,采用`TensorRT-LLM`快速请求chat model:
107
+ 修改 examples/pytorch/quickstart_advanced.py 中如下代码:
108
+
109
+
110
+ ```python
111
+ from tensorrt_llm import SamplingParams
112
+ from tensorrt_llm._torch import LLM
113
+ from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
114
+ from tensorrt_llm.llmapi import (EagleDecodingConfig, KvCacheConfig,
115
+ MTPDecodingConfig)
116
+
117
+ prompt = "Write a short summary of the benefits of regular exercise"
118
+
119
+ def main():
120
+ args = parse_arguments()
121
+
122
+ llm, sampling_params = setup_llm(args)
123
+ new_prompts = []
124
+ if args.apply_chat_template:
125
+ messages = [{"role": "user", "content": f"{prompt}"}]
126
+ new_prompts.append(llm.tokenizer.apply_chat_template(
127
+ messages, tokenize=False, add_generation_prompt=True)
128
+ )
129
+
130
+ outputs = llm.generate(new_prompts, sampling_params)
131
+
132
+ for i, output in enumerate(outputs):
133
+ prompt = output.prompt
134
+ generated_text = output.outputs[0].text
135
+ print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
136
+ ```
137
+
138
+ 运行方式:
139
+
140
+ ```shell
141
+ python3 quickstart_advanced.py --model_dir "HunyuanLLM模型路径" --tp_size 4 --apply_chat_template
142
+ ```
143
+
144
+ #### 方式2:服务化推理
145
+
146
+ 下面我们展示使用`TensorRT-LLM`服务化的方式部署模型和请求。
147
+
148
+ ```shell
149
+ model_path="HunyuanLLM模型路径"
150
+ trtllm-serve <model_path> [--backend pytorch --tp_size <tp> --ep_size <ep> --host <host> --port <port>]
151
+ ```
152
+
153
+ 服务启动成功后, 运行请求脚本:
154
+ ```python
155
+ ### OpenAI Chat Client
156
+
157
+ from openai import OpenAI
158
+
159
+ client = OpenAI(
160
+ base_url="http://localhost:8000/v1",
161
+ api_key="tensorrt_llm",
162
+ )
163
+
164
+ response = client.chat.completions.create(
165
+ model="default",
166
+ messages=[{
167
+ "role": "user",
168
+ "content": "Write a short summary of the benefits of regular exercise"
169
+ }],
170
+ max_tokens=4096,
171
+ )
172
+ print(response)
173
+ ```
174
+
175
+ #### FP8/Int4量化模型部署:
176
+ 目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中,敬请期待。
177
+
178
+
179
+ ## 使用vLLM推理
180
+ ### Docker:
181
+
182
+ 为了简化部署过程,HunyuanLLM提供了预构建docker镜像:
183
+
184
+ [hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm](https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags) 。您只需要下载模型文件并用下面代码启动docker即可开始推理模型。
185
+ ```shell
186
+ # 拉取
187
+ docker pull hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm
188
+ # 起镜像
189
+ docker run --name hunyuanLLM_infer -itd --privileged --user root --net=host --ipc=host --gpus=8 hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm
190
+ ```
191
+
192
+ 注: Docker容器权限管理。以上代码采用特权模式(--privileged)启动Docker容器会赋予容器较高的权限,增加数据泄露和集群安全风险。建议在非必要情况下避免使用特权模式,以降低安全威胁。对于必须使用特权模式的场景,应进行严格的安全评估,并实施相应的安全监控、加固措施。
193
+
194
+
195
+ ### BF16部署
196
+
197
+ BF16可以在2张显存超过80G的GPU卡上部署,如果长文推荐TP4。按如下步骤执行:
198
+
199
+ 运行命令前请先设置如下环境变量:
200
+
201
+ ```shell
202
+ export MODEL_PATH=PATH_TO_MODEL
203
+ ```
204
+
205
+ #### Step1:执行推理
206
+
207
+ #### 方式1:命令行推理
208
+
209
+ 下面我们展示一个代码片段,采用`vLLM`快速请求chat model:
210
+
211
+ 注: vLLM组件远程代码执行防护。下列代码中vLLM组件的trust-remote-code配置项若被启用,将允许加载并执行来自远程模型仓库的代码,这可能导致恶意代码的执行。除非业务需求明确要求,否则建议该配置项处于禁用状态,以降低潜在的安全威胁。
212
+
213
+
214
+ ```python
215
+ import os
216
+ from typing import List, Optional
217
+ from vllm import LLM, SamplingParams
218
+ from vllm.inputs import PromptType
219
+ from transformers import AutoTokenizer
220
+
221
+ model_path=os.environ.get('MODEL_PATH')
222
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
223
+
224
+ llm = LLM(model=model_path,
225
+ tokenizer=model_path,
226
+ trust_remote_code=True,
227
+ dtype='bfloat16',
228
+ tensor_parallel_size=4,
229
+ gpu_memory_utilization=0.9)
230
+
231
+ sampling_params = SamplingParams(
232
+ temperature=0.7, top_p=0.8, max_tokens=4096, top_k=20, repetition_penalty=1.05)
233
+
234
+ messages = [
235
+ {
236
+ "role": "system",
237
+ "content": "You are a helpful assistant.",
238
+ },
239
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
240
+ ]
241
+
242
+ tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
243
+
244
+ dummy_inputs: List[PromptType] = [{
245
+ "prompt_token_ids": batch
246
+ } for batch in tokenized_chat.numpy().tolist()]
247
+
248
+ outputs = llm.generate(dummy_inputs, sampling_params)
249
+
250
+ # Print the outputs.
251
+ for output in outputs:
252
+ prompt = output.prompt
253
+ generated_text = output.outputs[0].text
254
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
255
+ ```
256
+
257
+ #### 方式2:服务化推理
258
+
259
+ 下面我们展示使用`vLLM`服务化的方式部署模型并请求
260
+
261
+ 在主节点上运行:
262
+
263
+ ```shell
264
+ export VLLM_HOST_IP=${LOCAL_IP}
265
+ ```
266
+ 接着我们启动服务,运行 :
267
+ ```shell
268
+ cd inference
269
+ sh run_server.sh
270
+ ```
271
+
272
+ 运行`run_server.sh`成功后, 运行请求脚本:
273
+ ```shell
274
+ sh openapi.sh
275
+ ```
276
+
277
+ 注意修改`openapi.sh`中的`${LOCAL_IP}`和`${MODEL_PATH}`为服务对应值。
278
+
279
+
280
+ ### 量化模型部署:
281
+
282
+ 本部分介绍采用vLLM部署量化后模型的流程。
283
+
284
+ 镜像:部署镜像同BF16。
285
+
286
+
287
+ #### Int8量化模型部署:
288
+ 部署Int8-weight-only版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
289
+ ```SHELL
290
+ export MODEL_PATH=PATH_TO_BF16_MODEL
291
+ ```
292
+
293
+ 接着我们启动Int8服务。运行:
294
+ ```shell
295
+ sh run_server_int8.sh
296
+ ```
297
+
298
+ 运行`run_server_int8.sh`成功后, 运行请求脚本:
299
+ ```shell
300
+ sh openapi.sh
301
+ ```
302
+
303
+ #### Int4量化模型部署:
304
+ 部署Int4-weight-only版本HunYuan-A13B模型只需设置`run_server_int4.sh`中的环境变量,采用GPTQ方式:
305
+ ```SHELL
306
+ export MODEL_PATH=PATH_TO_INT4_MODEL
307
+ ```
308
+
309
+ 接着我们启动Int4服务。运行:
310
+ ```shell
311
+ sh run_server_int4.sh
312
+ ```
313
+
314
+ 运行`run_server_int4.sh`成功后, 运行请求脚本:
315
+ ```shell
316
+ sh openapi.sh
317
+ ```
318
+
319
+ #### FP8量化模型部署:
320
+ 部署W8A8C8版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
321
+ ```shell
322
+ export MODEL_PATH=PATH_TO_FP8_MODEL
323
+ ```
324
+
325
+ 接着我们启动FP8服务。运行:
326
+ ```shell
327
+ sh run_server_fp8.sh
328
+ ```
329
+
330
+ 运行`run_server_fp8.sh`成功后, 运行请求脚本:
331
+ ```shell
332
+ sh openapi.sh
333
+ ```
334
+
335
+ ### 性能评估:
336
+
337
+ 本部分介绍采用vLLM部署各个模型(原始模型和量化模型)的效率测试结果,包括不同Batchsize下的推理速度(tokens/s), 测试环境(腾讯云,H80(96G)GPU x 卡数):
338
+
339
+ 测试命令:
340
+ ```python
341
+ python3 benchmark_throughput.py --backend vllm \
342
+ --input-len 2048 \
343
+ --output-len 14336 \
344
+ --model $MODEL_PATH \
345
+ --tensor-parallel-size $TP \
346
+ --use-v2-block-manager \
347
+ --async-engine \
348
+ --trust-remote-code \
349
+ --num_prompts $BATCH_SIZE \
350
+ --max-num-seqs $BATCH_SIZE
351
+ ```
352
+
353
+ | 推理框架 | 模型 | 部署卡数 | input_length | batch=1 | batch=16 | batch=32 |
354
+ |------|-----------------------------|-----------|-------------------------|---------------------|----------------------|----------------------|
355
+ | vLLM | Hunyuan-A13B-Instruct | 8 | 2048 | 190.84 | 1246.54 | 1981.99 |
356
+ | vLLM | Hunyuan-A13B-Instruct | 4 | 2048 | 158.90 | 779.10 | 1301.75 |
357
+ | vLLM | Hunyuan-A13B-Instruct | 2 | 2048 | 111.72 | 327.31 | 346.54 |
358
+ | vLLM | Hunyuan-A13B-Instruct(int8 weight only) | 2 | 2048 | 109.10 | 444.17 | 721.93 |
359
+ | vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 2 | 2048 | 91.83 | 372.01 | 617.70 |
360
+ | vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 1 | 2048 | 60.07 | 148.80 | 160.41 |
361
+
362
+
363
+ ## 使用sglang推理
364
+
365
+ ### BF16部署
366
+
367
+ #### Step1:执行推理
368
+
369
+ #### 方式1:命令行推理
370
+
371
+ 下面我们展示一个代码片段,采用`sglang`快速请求chat model:
372
+
373
+
374
+ ```python
375
+ import sglang as sgl
376
+ from transformers import AutoTokenizer
377
+
378
+ model_path=os.environ.get('MODEL_PATH')
379
+
380
+
381
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
382
+
383
+ messages = [
384
+ {
385
+ "role": "system",
386
+ "content": "You are a helpful assistant.",
387
+ },
388
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
389
+ ]
390
+ prompts = []
391
+ prompts.append(tokenizer.apply_chat_template(
392
+ messages,
393
+ tokenize=False,
394
+ add_generation_prompt=True
395
+ ))
396
+ print(prompts)
397
+
398
+ llm = sgl.Engine(
399
+ model_path=model_path,
400
+ tp_size=4,
401
+ trust_remote_code=True,
402
+ mem_fraction_static=0.7,
403
+ )
404
+
405
+ sampling_params = {"temperature": 0.7, "top_p": 0.8, "top_k": 20, "max_new_tokens": 4096}
406
+ outputs = llm.generate(prompts, sampling_params)
407
+ for prompt, output in zip(prompts, outputs):
408
+ print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
409
+ ```
410
+
411
+ #### 方式2:服务化推理
412
+
413
+ 下面我们展示使用`sglang`服务化的方式部署模型和请求。
414
+
415
+ ```shell
416
+ model_path="HunyuanLLM模型路径"
417
+ python3 -u -m sglang.launch_server \
418
+ --model-path $model_path \
419
+ --tp 4 \
420
+ --trust-remote-code \
421
+ ```
422
+
423
+ 服务启动成功后, 运行请求脚本:
424
+ ```python
425
+ import openai
426
+ client = openai.Client(
427
+ base_url="http://localhost:30000/v1", api_key="EMPTY")
428
+
429
+ response = client.chat.completions.create(
430
+ model="default",
431
+ messages= [
432
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
433
+ ],
434
+ temperature=0.7,
435
+ max_tokens=4096,
436
+ extra_body={"top_p": 0.8, "top_k": 20}
437
+ )
438
+ print(response)
439
+ ```
440
+
441
+ #### FP8/Int4量化模型部署:
442
+ 目前 sglang 的 fp8 和 int4 量化模型正在支持中,敬请期待。
443
+
444
+ ## 交互式Demo Web
445
+ hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
446
+
447
+ <br>
448
+
449
+ ## 引用
450
+ 如果你觉得我们的工作对你有帮助,欢迎引用我们的<a href="report/Hunyuan_A13B_Technical_Report.pdf">技术报告</a>!
451
+
452
+ <br>
453
+
454
+
455
+ ## 联系我们
456
+ 如果你想给我们的研发和产品团队留言,欢迎联系我们腾讯混元LLM团队。你可以通过邮件([email protected])联系我们。
config.json ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_classification_head": false,
3
+ "anyres_pooling_size": 2,
4
+ "anyres_vit_max_image_size": null,
5
+ "anyres_vit_two_views": false,
6
+ "architectures": [
7
+ "HunYuanMoEV1ForCausalLM"
8
+ ],
9
+ "attention_bias": false,
10
+ "attention_dropout": 0.1,
11
+ "attention_head_dim": 128,
12
+ "auto_map": {
13
+ "AutoConfig": "configuration_hunyuan.HunYuanConfig",
14
+ "AutoModel": "hunyuan.HunYuanModel",
15
+ "AutoModelForCausalLM": "hunyuan.HunYuanMoEV1ForCausalLM"
16
+ },
17
+ "bos_token_id": 1,
18
+ "cla_share_factor": 2,
19
+ "class_num": 0,
20
+ "dense_list": [
21
+ 4096,
22
+ 0
23
+ ],
24
+ "eod_token_id": 127967,
25
+ "eos_token_id": 127960,
26
+ "group_limited_greedy": false,
27
+ "hidden_act": "silu",
28
+ "hidden_size": 4096,
29
+ "im_end_id": 6,
30
+ "im_newline_id": 12,
31
+ "im_start_id": 5,
32
+ "image_token_id": 9,
33
+ "initializer_range": 0.02,
34
+ "intermediate_size": 3072,
35
+ "kv_lora_rank": null,
36
+ "mask_init_id": 13,
37
+ "max_position_embeddings": 32768,
38
+ "mlp_bias": false,
39
+ "model_type": "hunyuan",
40
+ "moe_drop_tokens": false,
41
+ "moe_intermediate_size": [
42
+ 3072,
43
+ 3072,
44
+ 3072,
45
+ 3072,
46
+ 3072,
47
+ 3072,
48
+ 3072,
49
+ 3072,
50
+ 3072,
51
+ 3072,
52
+ 3072,
53
+ 3072,
54
+ 3072,
55
+ 3072,
56
+ 3072,
57
+ 3072,
58
+ 3072,
59
+ 3072,
60
+ 3072,
61
+ 3072,
62
+ 3072,
63
+ 3072,
64
+ 3072,
65
+ 3072,
66
+ 3072,
67
+ 3072,
68
+ 3072,
69
+ 3072,
70
+ 3072,
71
+ 3072,
72
+ 3072,
73
+ 3072
74
+ ],
75
+ "moe_layer_num_skipped": 0,
76
+ "moe_random_routing_dropped_token": false,
77
+ "moe_topk": [
78
+ 8,
79
+ 8,
80
+ 8,
81
+ 8,
82
+ 8,
83
+ 8,
84
+ 8,
85
+ 8,
86
+ 8,
87
+ 8,
88
+ 8,
89
+ 8,
90
+ 8,
91
+ 8,
92
+ 8,
93
+ 8,
94
+ 8,
95
+ 8,
96
+ 8,
97
+ 8,
98
+ 8,
99
+ 8,
100
+ 8,
101
+ 8,
102
+ 8,
103
+ 8,
104
+ 8,
105
+ 8,
106
+ 8,
107
+ 8,
108
+ 8,
109
+ 8
110
+ ],
111
+ "n_group": null,
112
+ "norm_topk_prob": true,
113
+ "norm_type": "rms",
114
+ "num_attention_heads": 32,
115
+ "num_experts": 64,
116
+ "num_hidden_layers": 32,
117
+ "num_key_value_heads": 8,
118
+ "num_media_embeds": 257,
119
+ "num_shared_expert": [
120
+ 1,
121
+ 1,
122
+ 1,
123
+ 1,
124
+ 1,
125
+ 1,
126
+ 1,
127
+ 1,
128
+ 1,
129
+ 1,
130
+ 1,
131
+ 1,
132
+ 1,
133
+ 1,
134
+ 1,
135
+ 1,
136
+ 1,
137
+ 1,
138
+ 1,
139
+ 1,
140
+ 1,
141
+ 1,
142
+ 1,
143
+ 1,
144
+ 1,
145
+ 1,
146
+ 1,
147
+ 1,
148
+ 1,
149
+ 1,
150
+ 1,
151
+ 1
152
+ ],
153
+ "org_vocab_size": 128167,
154
+ "pad_id": 127961,
155
+ "pad_token_id": 127961,
156
+ "pool_type": "last",
157
+ "position_embedding_xdrope": false,
158
+ "pretraining_tp": 1,
159
+ "q_lora_rank": null,
160
+ "qk_nope_head_dim": null,
161
+ "qk_rope_head_dim": null,
162
+ "rms_norm_eps": 1e-05,
163
+ "rope_scaling": {
164
+ "alpha": 1000.0,
165
+ "beta_fast": 32,
166
+ "beta_slow": 1,
167
+ "factor": 1.0,
168
+ "mscale": 1.0,
169
+ "mscale_all_dim": 1.0,
170
+ "type": "dynamic"
171
+ },
172
+ "rope_theta": 10000.0,
173
+ "routed_scaling_factor": 1.0,
174
+ "sep_token_id": 127962,
175
+ "skip_cls_token": false,
176
+ "text_end_id": 8,
177
+ "text_start_id": 7,
178
+ "tie_word_embeddings": true,
179
+ "topk_group": null,
180
+ "torch_dtype": "bfloat16",
181
+ "transformers_version": "4.53.0",
182
+ "use_cache": true,
183
+ "use_cla": false,
184
+ "use_mixed_mlp_moe": true,
185
+ "use_mla": false,
186
+ "use_qk_norm": true,
187
+ "use_rotary_pos_emb": true,
188
+ "v_head_dim": null,
189
+ "video_end_id": 11,
190
+ "video_start_id": 10,
191
+ "vit_add_patchemb_bias": false,
192
+ "vit_input_resolution": 224,
193
+ "vit_mapping_type": "resampler",
194
+ "vit_norm_type": "fused",
195
+ "vit_patch": 1,
196
+ "vit_path": null,
197
+ "vit_remove_prenorm": false,
198
+ "vit_token": 64,
199
+ "vit_type": null,
200
+ "vit_used_rms_norm": false,
201
+ "vocab_size": 128167,
202
+ "xdrope_section": null
203
+ }
configuration_hunyuan.py ADDED
@@ -0,0 +1,319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (C) 2024 THL A29 Limited, a Tencent company. All rights reserved.
3
+ """ HunYuan model configuration"""
4
+ from torch import nn
5
+ from transformers.configuration_utils import PretrainedConfig
6
+ from transformers.utils import logging
7
+ from typing import List, Union, Optional
8
+
9
+
10
+ logger = logging.get_logger(__name__)
11
+
12
+
13
+ class HunYuanConfig(PretrainedConfig):
14
+ r"""
15
+ This is the configuration class to store the configuration of a [`HunYuanModel`]. It is used to instantiate an
16
+ HunYuan model according to the specified arguments, defining the model architecture. Instantiating a configuration
17
+ with the defaults will yield a similar configuration to that of the HunYuan-7B.
18
+
19
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
20
+ documentation from [`PretrainedConfig`] for more information.
21
+
22
+
23
+ Args:
24
+ vocab_size (`int`, *optional*, defaults to 32000):
25
+ Vocabulary size of the HunYuan model. Defines the number of different tokens that can be represented by the
26
+ `inputs_ids` passed when calling [`HunYuanModel`]
27
+ hidden_size (`int`, *optional*, defaults to 4096):
28
+ Dimension of the hidden representations.
29
+ intermediate_size (`int`, *optional*, defaults to 11008):
30
+ Dimension of the MLP representations or shared MLP representations.
31
+ moe_intermediate_size (`int` or `List`, *optional*, defaults to 11008):
32
+ Dimension of the MLP representations in MoE. Use a list if you want a different size per layer.
33
+ num_hidden_layers (`int`, *optional*, defaults to 32):
34
+ Number of hidden layers in the Transformer decoder.
35
+ num_attention_heads (`int`, *optional*, defaults to 32):
36
+ Number of attention heads for each attention layer in the Transformer decoder.
37
+ num_key_value_heads (`int`, *optional*):
38
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
39
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
40
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
41
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
42
+ by meanpooling all the original heads within that group. For more details checkout [this
43
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
44
+ `num_attention_heads`.
45
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
46
+ The non-linear activation function (function or string) in the decoder.
47
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
48
+ The maximum sequence length that this model might ever be used with.
49
+ initializer_range (`float`, *optional*, defaults to 0.02):
50
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
51
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
52
+ The epsilon used by the rms normalization layers.
53
+ use_cache (`bool`, *optional*, defaults to `True`):
54
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
55
+ relevant if `config.is_decoder=True`.
56
+ pad_token_id (`int`, *optional*):
57
+ Padding token id.
58
+ bos_token_id (`int`, *optional*, defaults to 1):
59
+ Beginning of stream token id.
60
+ eos_token_id (`int`, *optional*, defaults to 2):
61
+ End of stream token id.
62
+ pretraining_tp (`int`, *optional*, defaults to 1):
63
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
64
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
65
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
66
+ issue](https://github.com/pytorch/pytorch/issues/76232).
67
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
68
+ Whether to tie weight embeddings
69
+ rope_theta (`float`, *optional*, defaults to 10000.0):
70
+ The base period of the RoPE embeddings.
71
+ rope_scaling (`Dict`, *optional*):
72
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
73
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
74
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
75
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
76
+ these scaling strategies behave:
77
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
78
+ experimental feature, subject to breaking API changes in future versions.
79
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
80
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
81
+ attention_dropout (`float`, *optional*, defaults to 0.0):
82
+ The dropout ratio for the attention probabilities.
83
+ use_qk_norm (`bool`, *optional*, defaults to `False`):
84
+ Whether query and key in attention use norm
85
+ use_cla (`bool`, *optional*, defaults to `False`):
86
+ Whether to use CLA in attention
87
+ cla_share_factor (`int`, *optional*, defaults to 1):
88
+ The share factor of CLA
89
+ num_experts (`int` or `List`, *optional*, defaults to 1):
90
+ The number of experts for moe. If it is a list, it will be used as the number of experts for each layer.
91
+ num_shared_expert (`int` or `List`, *optional*, defaults to 1):
92
+ The number of shared experts for moe. If it is a list, it will be used as the number of shared experts for each layer.
93
+ moe_topk (`int` or `List`, *optional*, defaults to 1):
94
+ The topk value for moe. If it is a list, it will be used as the topk value for each layer.
95
+ capacity_factor (Not used) (`float` or `List`, *optional*, defaults to 1.0):
96
+ The capacity factor for moe. If it is a list, it will be used as the capacity factor for each layer.
97
+ moe_layer_num_skipped (`int`, *optional*, defaults to 0):
98
+ First moe_layer_num_skipped layers do not use MoE.
99
+ """
100
+
101
+ model_type = "hunyuan"
102
+ keys_to_ignore_at_inference = ["past_key_values"]
103
+
104
+ def __init__(
105
+ self,
106
+ vocab_size=290943,
107
+ org_vocab_size=290943,
108
+ hidden_size=4096,
109
+ intermediate_size: int=11008,
110
+ moe_intermediate_size: Union[int, List]=None,
111
+ num_hidden_layers=32,
112
+ num_attention_heads=32,
113
+ num_key_value_heads=None,
114
+ attention_head_dim=None,
115
+ hidden_act="silu",
116
+ max_position_embeddings=2048,
117
+ initializer_range=0.02,
118
+ rms_norm_eps=1e-5,
119
+ use_cache=True,
120
+ pad_token_id=0,
121
+ bos_token_id=1,
122
+ eos_token_id=2,
123
+ eod_token_id=3,
124
+ sep_token_id=4,
125
+ im_start_id=5,
126
+ im_end_id=6,
127
+ text_start_id=7,
128
+ text_end_id=8,
129
+ image_token_id=9,
130
+ video_start_id=10,
131
+ video_end_id=11,
132
+ im_newline_id=12,
133
+ mask_init_id=13,
134
+ pretraining_tp=1,
135
+ tie_word_embeddings=False,
136
+ rope_theta=10000.0,
137
+ rope_scaling=None,
138
+ attention_bias=False,
139
+ mlp_bias=False,
140
+ attention_dropout=0.0,
141
+ use_qk_norm=False,
142
+ use_rotary_pos_emb=True,
143
+ use_cla=False,
144
+ cla_share_factor=1,
145
+ norm_type="hf_rms",
146
+ num_experts: Union[int, List]=1,
147
+ use_mixed_mlp_moe=False,
148
+ num_shared_expert: Union[int, List]=1,
149
+ moe_topk: Union[int, List]=1,
150
+ # capacity_factor: Union[int, List]=1.0,
151
+ moe_drop_tokens=False,
152
+ moe_random_routing_dropped_token=False,
153
+ use_mla=False,
154
+ kv_lora_rank=512,
155
+ q_lora_rank=1536,
156
+ qk_rope_head_dim=64,
157
+ v_head_dim=128,
158
+ qk_nope_head_dim=128,
159
+ moe_layer_num_skipped=0,
160
+ norm_topk_prob=True,
161
+ routed_scaling_factor=1.0,
162
+ group_limited_greedy=False,
163
+ n_group=None,
164
+ topk_group=None,
165
+ vit_path=None,
166
+ num_media_embeds=257,
167
+ vit_type="AnyResVit",
168
+ vit_input_resolution=224,
169
+ vit_token=64,
170
+ vit_patch=1,
171
+ vit_mapping_type="simple_conv_mlp",
172
+ vit_norm_type="fused",
173
+ vit_used_rms_norm=True,
174
+ vit_remove_prenorm=True,
175
+ vit_add_patchemb_bias=True,
176
+ anyres_vit_max_image_size=2048,
177
+ anyres_pooling_size=2,
178
+ anyres_vit_two_views=False,
179
+ skip_cls_token=False,
180
+ position_embedding_xdrope=False,
181
+ xdrope_section=None,
182
+ add_classification_head=False,
183
+ class_num=0,
184
+ pool_type="last",
185
+ pad_id=-1,
186
+ **kwargs,
187
+ ):
188
+ self.vocab_size = vocab_size
189
+ self.org_vocab_size = org_vocab_size
190
+ self.max_position_embeddings = max_position_embeddings
191
+ self.hidden_size = hidden_size
192
+ self.intermediate_size = intermediate_size
193
+ self.moe_intermediate_size = moe_intermediate_size
194
+ self.num_hidden_layers = num_hidden_layers
195
+ self.num_attention_heads = num_attention_heads
196
+ self.num_experts = num_experts
197
+ self.use_mixed_mlp_moe = use_mixed_mlp_moe
198
+ self.num_shared_expert = num_shared_expert
199
+ self.moe_topk = moe_topk
200
+ # self.capacity_factor = capacity_factor
201
+ self.moe_drop_tokens = moe_drop_tokens
202
+ self.moe_random_routing_dropped_token = moe_random_routing_dropped_token
203
+
204
+ if attention_head_dim is not None:
205
+ self.attention_head_dim = attention_head_dim
206
+ else:
207
+ self.attention_head_dim = self.hidden_size // num_attention_heads
208
+
209
+ # for backward compatibility
210
+ if num_key_value_heads is None:
211
+ num_key_value_heads = num_attention_heads
212
+
213
+ self.num_key_value_heads = num_key_value_heads
214
+ self.hidden_act = hidden_act
215
+ self.initializer_range = initializer_range
216
+ self.rms_norm_eps = rms_norm_eps
217
+ self.pretraining_tp = pretraining_tp
218
+ self.use_cache = use_cache
219
+ self.rope_theta = rope_theta
220
+ self.rope_scaling = rope_scaling
221
+ # self._rope_scaling_validation() # TODO: Need validation?
222
+ self.attention_bias = attention_bias
223
+ self.mlp_bias = mlp_bias
224
+ self.attention_dropout = attention_dropout
225
+ self.use_qk_norm = use_qk_norm
226
+ self.use_rotary_pos_emb = use_rotary_pos_emb
227
+ self.use_cla = use_cla
228
+ self.cla_share_factor = cla_share_factor
229
+ self.norm_type = norm_type
230
+ # MLA args
231
+ self.use_mla = use_mla
232
+ self.kv_lora_rank = kv_lora_rank
233
+ self.q_lora_rank = q_lora_rank
234
+ self.qk_rope_head_dim = qk_rope_head_dim
235
+ self.qk_nope_head_dim = qk_nope_head_dim
236
+ self.v_head_dim = v_head_dim
237
+
238
+ # DeepSeek related args
239
+ self.moe_layer_num_skipped = moe_layer_num_skipped
240
+ self.norm_topk_prob = norm_topk_prob
241
+ self.routed_scaling_factor = routed_scaling_factor
242
+ self.group_limited_greedy = group_limited_greedy
243
+ self.n_group = n_group
244
+ self.topk_group = topk_group
245
+ self.add_classification_head = add_classification_head
246
+ self.class_num = class_num
247
+ self.pool_type = pool_type
248
+ self.pad_id = pad_id
249
+
250
+ if self.class_num is not None:
251
+ self.dense_list = [self.hidden_size, self.class_num]
252
+
253
+ # Vit args
254
+ self.vit_path = vit_path
255
+ self.num_media_embeds = num_media_embeds
256
+ self.vit_type = vit_type
257
+ self.vit_input_resolution = vit_input_resolution
258
+ self.vit_token = vit_token
259
+ self.vit_patch = vit_patch
260
+ self.vit_mapping_type = vit_mapping_type
261
+ self.vit_norm_type = vit_norm_type
262
+ self.vit_used_rms_norm = vit_used_rms_norm
263
+ self.vit_remove_prenorm = vit_remove_prenorm
264
+ self.vit_add_patchemb_bias = vit_add_patchemb_bias
265
+ self.anyres_vit_max_image_size = anyres_vit_max_image_size
266
+ self.anyres_pooling_size = anyres_pooling_size
267
+ self.anyres_vit_two_views = anyres_vit_two_views
268
+ self.skip_cls_token = skip_cls_token
269
+ self.position_embedding_xdrope = position_embedding_xdrope
270
+ self.xdrope_section = xdrope_section
271
+
272
+ # token id
273
+ self.eod_token_id = eod_token_id
274
+ self.im_start_id = im_start_id
275
+ self.im_end_id = im_end_id
276
+ self.text_start_id = text_start_id
277
+ self.text_end_id = text_end_id
278
+ self.image_token_id = image_token_id
279
+ self.video_start_id = video_start_id
280
+ self.video_end_id = video_end_id
281
+ self.im_newline_id = im_newline_id
282
+ self.mask_init_id = mask_init_id
283
+
284
+ super().__init__(
285
+ pad_token_id=pad_token_id,
286
+ bos_token_id=bos_token_id,
287
+ eos_token_id=eos_token_id,
288
+ sep_token_id=sep_token_id,
289
+ tie_word_embeddings=tie_word_embeddings,
290
+ **kwargs,
291
+ )
292
+
293
+ def _rope_scaling_validation(self):
294
+ """
295
+ Validate the `rope_scaling` configuration.
296
+ """
297
+ if self.rope_scaling is None:
298
+ return
299
+
300
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
301
+ raise ValueError(
302
+ "`rope_scaling` must be a dictionary with with two fields, `type` and `factor` or `type` and `alpha`, "
303
+ f"got {self.rope_scaling}"
304
+ )
305
+ rope_scaling_type = self.rope_scaling.get("type", None)
306
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
307
+ rope_scaling_alpha = self.rope_scaling.get("alpha", None)
308
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
309
+ raise ValueError(
310
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
311
+ )
312
+ if rope_scaling_factor is None and rope_scaling_alpha is None:
313
+ raise ValueError("`rope_scaling`'s factor or alpha field must be have one, got both of none")
314
+ if rope_scaling_factor is not None:
315
+ if not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
316
+ raise ValueError(f"`rope_scaling`'s factor field must be a float > 1.0, got {rope_scaling_factor}")
317
+ if rope_scaling_alpha is not None:
318
+ if not isinstance(rope_scaling_alpha, float) or rope_scaling_alpha <= 1.0:
319
+ raise ValueError(f"`rope_scaling`'s alpha field must be a float > 1.0, got {rope_scaling_alpha}")
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "eos_token_id": [
4
+ 127960,
5
+ 127967
6
+ ],
7
+ "pad_token_id": 127961,
8
+ "repetition_penalty": 1.05,
9
+ "temperature": 0.7,
10
+ "top_k": 20,
11
+ "top_p": 0.8,
12
+ "transformers_version": "4.53.0"
13
+ }
hunyuan.py ADDED
@@ -0,0 +1,851 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (C) 2024 THL A29 Limited, a Tencent company. All rights reserved.
3
+ #
4
+ """ PyTorch HunYuan model."""
5
+
6
+ import math
7
+ import warnings
8
+ from typing import List, Optional, Tuple, Union
9
+
10
+ import torch
11
+ from torch import Tensor
12
+ import torch.nn.functional as F
13
+ import torch.utils.checkpoint
14
+ from torch import nn
15
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
16
+
17
+ from transformers.activations import ACT2FN
18
+ from transformers.cache_utils import Cache, DynamicCache
19
+ from transformers.modeling_attn_mask_utils import (
20
+ AttentionMaskConverter,
21
+ _prepare_4d_attention_mask,
22
+ _prepare_4d_causal_attention_mask,
23
+ _prepare_4d_causal_attention_mask_for_sdpa,
24
+ )
25
+ from transformers.modeling_outputs import (
26
+ BaseModelOutputWithPast,
27
+ CausalLMOutputWithPast,
28
+ SequenceClassifierOutputWithPast
29
+ )
30
+ from transformers.modeling_utils import PreTrainedModel
31
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS, is_torch_greater_or_equal_than_1_13
32
+ from transformers.utils import (
33
+ add_start_docstrings,
34
+ add_start_docstrings_to_model_forward,
35
+ is_flash_attn_2_available,
36
+ is_flash_attn_greater_or_equal_2_10,
37
+ logging,
38
+ replace_return_docstrings,
39
+ )
40
+ from transformers.utils.import_utils import is_torch_fx_available
41
+ from transformers.generation.utils import GenerateOutput
42
+ from .configuration_hunyuan import HunYuanConfig
43
+ from .modeling_hunyuan import HunYuanDecoderLayer, HunYuanRMSNorm
44
+
45
+
46
+ if is_flash_attn_2_available():
47
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
48
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
49
+
50
+
51
+ # This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
52
+ # It means that the function will not be traced through and simply appear as a node in the graph.
53
+ if is_torch_fx_available():
54
+ if not is_torch_greater_or_equal_than_1_13:
55
+ import torch.fx
56
+
57
+ _prepare_4d_causal_attention_mask = torch.fx.wrap(_prepare_4d_causal_attention_mask)
58
+
59
+
60
+
61
+ _CONFIG_FOR_DOC = "HunYuanConfig"
62
+
63
+
64
+ HUNYUAN_START_DOCSTRING = r"""
65
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
66
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
67
+ etc.)
68
+
69
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
70
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
71
+ and behavior.
72
+
73
+ Parameters:
74
+ config ([`HunYuanConfig`]):
75
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
76
+ load the weights associated with the model, only the configuration. Check out the
77
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
78
+ """
79
+
80
+
81
+ @add_start_docstrings(
82
+ "The bare HunYuan Model outputting raw hidden-states without any specific head on top.",
83
+ HUNYUAN_START_DOCSTRING,
84
+ )
85
+ class HunYuanPreTrainedModel(PreTrainedModel):
86
+ config_class = HunYuanConfig
87
+ base_model_prefix = "model"
88
+ supports_gradient_checkpointing = True
89
+ _no_split_modules = ["HunYuanDecoderLayer"]
90
+ _skip_keys_device_placement = "past_key_values"
91
+ _supports_flash_attn_2 = True
92
+ _supports_sdpa = True
93
+ _supports_cache_class = True
94
+
95
+ def _init_weights(self, module):
96
+ std = self.config.initializer_range
97
+ if isinstance(module, nn.Linear):
98
+ module.weight.data.normal_(mean=0.0, std=std)
99
+ if module.bias is not None:
100
+ module.bias.data.zero_()
101
+ elif isinstance(module, nn.Embedding):
102
+ module.weight.data.normal_(mean=0.0, std=std)
103
+ if module.padding_idx is not None:
104
+ module.weight.data[module.padding_idx].zero_()
105
+
106
+
107
+ HUNYUAN_INPUTS_DOCSTRING = r"""
108
+ Args:
109
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
110
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
111
+ it.
112
+
113
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
114
+ [`PreTrainedTokenizer.__call__`] for details.
115
+
116
+ [What are input IDs?](../glossary#input-ids)
117
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
118
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
119
+
120
+ - 1 for tokens that are **not masked**,
121
+ - 0 for tokens that are **masked**.
122
+
123
+ [What are attention masks?](../glossary#attention-mask)
124
+
125
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
126
+ [`PreTrainedTokenizer.__call__`] for details.
127
+
128
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
129
+ `past_key_values`).
130
+
131
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
132
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
133
+ information on the default strategy.
134
+
135
+ - 1 indicates the head is **not masked**,
136
+ - 0 indicates the head is **masked**.
137
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
138
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
139
+ config.n_positions - 1]`.
140
+
141
+ [What are position IDs?](../glossary#position-ids)
142
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
143
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
144
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
145
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
146
+
147
+ Two formats are allowed:
148
+ - a [`~cache_utils.Cache`] instance;
149
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
150
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
151
+ cache format.
152
+
153
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
154
+ legacy cache format will be returned.
155
+
156
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
157
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
158
+ of shape `(batch_size, sequence_length)`.
159
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
160
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
161
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
162
+ model's internal embedding lookup matrix.
163
+ use_cache (`bool`, *optional*):
164
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
165
+ `past_key_values`).
166
+ output_attentions (`bool`, *optional*):
167
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
168
+ tensors for more detail.
169
+ output_hidden_states (`bool`, *optional*):
170
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
171
+ more detail.
172
+ return_dict (`bool`, *optional*):
173
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
174
+ """
175
+
176
+
177
+ @add_start_docstrings(
178
+ "The bare HunYuan Model outputting raw hidden-states without any specific head on top.",
179
+ HUNYUAN_START_DOCSTRING,
180
+ )
181
+ class HunYuanModel(HunYuanPreTrainedModel):
182
+ """
183
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`HunYuanDecoderLayer`]
184
+
185
+ Args:
186
+ config: HunYuanConfig
187
+ """
188
+
189
+ def __init__(self, config: HunYuanConfig):
190
+ super().__init__(config)
191
+ self.padding_idx = config.pad_token_id
192
+ self.vocab_size = config.vocab_size
193
+ self.add_classification_head = config.add_classification_head
194
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
195
+ self.layers = nn.ModuleList(
196
+ [HunYuanDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
197
+ )
198
+ self._use_sdpa = config._attn_implementation == "sdpa"
199
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
200
+ if not config.add_classification_head:
201
+ self.norm = HunYuanRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
202
+
203
+ self.cla = config.use_cla
204
+ self.cla_share_factor = config.cla_share_factor
205
+
206
+ self.gradient_checkpointing = False
207
+ # Initialize weights and apply final processing
208
+ self.post_init()
209
+
210
+ def get_input_embeddings(self):
211
+ return self.embed_tokens
212
+
213
+ def set_input_embeddings(self, value):
214
+ self.embed_tokens = value
215
+
216
+ @add_start_docstrings_to_model_forward(HUNYUAN_INPUTS_DOCSTRING)
217
+ def forward(
218
+ self,
219
+ input_ids: torch.LongTensor = None,
220
+ attention_mask: Optional[torch.Tensor] = None,
221
+ position_ids: Optional[torch.LongTensor] = None,
222
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
223
+ inputs_embeds: Optional[torch.FloatTensor] = None,
224
+ use_cache: Optional[bool] = None,
225
+ output_attentions: Optional[bool] = None,
226
+ output_hidden_states: Optional[bool] = None,
227
+ return_dict: Optional[bool] = None,
228
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
229
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
230
+ output_hidden_states = (
231
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
232
+ )
233
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
234
+
235
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
236
+
237
+ # retrieve input_ids and inputs_embeds
238
+ # if input_ids is not None and inputs_embeds is not None:
239
+ # raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
240
+ if input_ids is not None:
241
+ batch_size, seq_length = input_ids.shape[:2]
242
+ elif inputs_embeds is not None:
243
+ batch_size, seq_length = inputs_embeds.shape[:2]
244
+ else:
245
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
246
+
247
+ if self.gradient_checkpointing and self.training:
248
+ if use_cache:
249
+ logger.warning_once(
250
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
251
+ )
252
+ use_cache = False
253
+
254
+ past_key_values_length = 0
255
+ if use_cache:
256
+ use_legacy_cache = not isinstance(past_key_values, Cache)
257
+ if use_legacy_cache:
258
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
259
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
260
+
261
+ if position_ids is None:
262
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
263
+ position_ids = torch.arange(
264
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
265
+ )
266
+ position_ids = position_ids.unsqueeze(0)
267
+
268
+ if inputs_embeds is None:
269
+ inputs_embeds = self.embed_tokens(input_ids)
270
+
271
+ # Fix lora with gradient checkpointing training
272
+ if self.training and inputs_embeds.is_leaf:
273
+ inputs_embeds.requires_grad = True
274
+
275
+ if self._use_flash_attention_2:
276
+ # 2d mask is passed through the layers
277
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
278
+ elif self._use_sdpa and not output_attentions:
279
+ # output_attentions=True can not be supported when using SDPA, and we fall back on
280
+ # the manual implementation that requires a 4D causal mask in all cases.
281
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
282
+ attention_mask,
283
+ (batch_size, seq_length),
284
+ inputs_embeds,
285
+ past_key_values_length,
286
+ )
287
+ else:
288
+ # 4d mask is passed through the layers
289
+ attention_mask = _prepare_4d_causal_attention_mask(
290
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
291
+ )
292
+
293
+ # embed positions
294
+ hidden_states = inputs_embeds
295
+
296
+ # decoder layers
297
+ all_hidden_states = () if output_hidden_states else None
298
+ all_self_attns = () if output_attentions else None
299
+ next_decoder_cache = None
300
+
301
+ prev_kv_states = None
302
+ for layer_idx, decoder_layer in enumerate(self.layers):
303
+ if output_hidden_states:
304
+ all_hidden_states += (hidden_states,)
305
+
306
+ if self.gradient_checkpointing and self.training:
307
+ layer_outputs = self._gradient_checkpointing_func(
308
+ decoder_layer.__call__,
309
+ hidden_states,
310
+ attention_mask,
311
+ position_ids,
312
+ past_key_values,
313
+ output_attentions,
314
+ use_cache,
315
+ prev_kv_states,
316
+ )
317
+ else:
318
+ layer_outputs = decoder_layer(
319
+ hidden_states,
320
+ attention_mask=attention_mask,
321
+ position_ids=position_ids,
322
+ past_key_value=past_key_values,
323
+ output_attentions=output_attentions,
324
+ use_cache=use_cache,
325
+ kv_states=prev_kv_states
326
+ )
327
+
328
+ hidden_states = layer_outputs[0]
329
+
330
+ if use_cache:
331
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
332
+
333
+ if output_attentions:
334
+ all_self_attns += (layer_outputs[1],)
335
+
336
+ kv_states = layer_outputs[-1]
337
+
338
+ if self.cla and layer_idx % self.cla_share_factor == 0:
339
+ prev_kv_states = kv_states
340
+ if not self.add_classification_head:
341
+ hidden_states = self.norm(hidden_states)
342
+
343
+ # add hidden states from the last decoder layer
344
+ if output_hidden_states:
345
+ all_hidden_states += (hidden_states,)
346
+
347
+ next_cache = None
348
+ if use_cache:
349
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
350
+ if not return_dict:
351
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
352
+ return BaseModelOutputWithPast(
353
+ last_hidden_state=hidden_states,
354
+ past_key_values=next_cache,
355
+ hidden_states=all_hidden_states,
356
+ attentions=all_self_attns,
357
+ )
358
+
359
+
360
+ class HunYuanMoEV1ForCausalLM(HunYuanPreTrainedModel):
361
+ _tied_weights_keys = ["lm_head.weight"]
362
+
363
+ def __init__(self, config: HunYuanConfig):
364
+ super().__init__(config)
365
+
366
+ self.config = config
367
+ self.model = HunYuanModel(config)
368
+ self.add_classification_head = config.add_classification_head
369
+ self.pad_id = config.pad_id
370
+ self.vocab_size = config.vocab_size
371
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
372
+ if config.add_classification_head:
373
+ self.pool_head = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
374
+ self.pool_head2 = nn.Linear(config.hidden_size, config.class_num, bias=False)
375
+ # Initialize weights and apply final processing
376
+ self.post_init()
377
+
378
+ def get_input_embeddings(self):
379
+ return self.model.embed_tokens
380
+
381
+ def set_input_embeddings(self, value):
382
+ self.model.embed_tokens = value
383
+
384
+ def get_output_embeddings(self):
385
+ return self.lm_head
386
+
387
+ def set_output_embeddings(self, new_embeddings):
388
+ self.lm_head = new_embeddings
389
+
390
+ def set_decoder(self, decoder):
391
+ self.model = decoder
392
+
393
+ def get_decoder(self):
394
+ return self.model
395
+
396
+ @add_start_docstrings_to_model_forward(HUNYUAN_INPUTS_DOCSTRING)
397
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
398
+ def forward(
399
+ self,
400
+ input_ids: torch.LongTensor = None,
401
+ attention_mask: Optional[torch.Tensor] = None,
402
+ position_ids: Optional[torch.LongTensor] = None,
403
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
404
+ inputs_embeds: Optional[torch.FloatTensor] = None,
405
+ labels: Optional[torch.LongTensor] = None,
406
+ use_cache: Optional[bool] = None,
407
+ output_attentions: Optional[bool] = None,
408
+ output_hidden_states: Optional[bool] = None,
409
+ return_dict: Optional[bool] = None,
410
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
411
+ r"""
412
+ Args:
413
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
414
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
415
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
416
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
417
+
418
+ Returns:
419
+
420
+ Example:
421
+
422
+ ```python
423
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM
424
+
425
+ >>> model = AutoModelForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
426
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
427
+
428
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
429
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
430
+
431
+ >>> # Generate
432
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
433
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
434
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
435
+ ```"""
436
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
437
+ output_hidden_states = (
438
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
439
+ )
440
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
441
+
442
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
443
+ outputs = self.model(
444
+ input_ids=input_ids,
445
+ attention_mask=attention_mask,
446
+ position_ids=position_ids,
447
+ past_key_values=past_key_values,
448
+ inputs_embeds=inputs_embeds,
449
+ use_cache=use_cache,
450
+ output_attentions=output_attentions,
451
+ output_hidden_states=output_hidden_states,
452
+ return_dict=return_dict,
453
+ )
454
+
455
+ hidden_states = outputs[0]
456
+
457
+ if not self.add_classification_head:
458
+ if self.config.pretraining_tp > 1:
459
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
460
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
461
+ logits = torch.cat(logits, dim=-1)
462
+ else:
463
+ logits = self.lm_head(hidden_states)
464
+ logits = logits.float()
465
+ else:
466
+ logits = hidden_states
467
+ logits = logits.float()
468
+ pooled_output = self.pool_head(logits)
469
+ pooled_output = torch.tanh(pooled_output)
470
+ pooled_output = self.pool_head2(pooled_output).contiguous() # bs * class_num
471
+ if len(pooled_output.shape) < 2:
472
+ raise ValueError("pooled_output does not have enough dimensions for transpose")
473
+
474
+ if self.config.pool_type == "mean":
475
+ reward = pooled_output.mean(dim=1).squeeze(-1)
476
+ elif self.config.pool_type == "last":
477
+ # bs * hidden_size
478
+ seq_length = (input_ids != self.pad_id).long().sum(dim=1) - 1
479
+ batch_size = input_ids.size(0)
480
+ reward = pooled_output[torch.arange(batch_size, device=pooled_output.device), seq_length].squeeze(-1)
481
+ else:
482
+ reward = pooled_output[:, 0].squeeze(-1)
483
+
484
+ loss = None
485
+ if labels is not None:
486
+ # Shift so that tokens < n predict n
487
+ shift_logits = logits[..., :-1, :].contiguous()
488
+ shift_labels = labels[..., 1:].contiguous()
489
+ # Flatten the tokens
490
+ loss_fct = CrossEntropyLoss()
491
+ shift_logits = shift_logits.reshape(-1, self.config.vocab_size)
492
+ shift_labels = shift_labels.reshape(-1)
493
+ # Enable model parallelism
494
+ shift_labels = shift_labels.to(shift_logits.device)
495
+ loss = loss_fct(shift_logits, shift_labels)
496
+
497
+ if not return_dict:
498
+ output = (logits,) + outputs[1:]
499
+ return (loss,) + output if loss is not None else output
500
+
501
+ output = CausalLMOutputWithPast(
502
+ loss=loss,
503
+ logits=logits,
504
+ past_key_values=outputs.past_key_values,
505
+ hidden_states=outputs.hidden_states,
506
+ attentions=outputs.attentions,
507
+ )
508
+ if self.add_classification_head:
509
+ output['reward'] = reward
510
+
511
+ return output
512
+
513
+ def prepare_inputs_for_generation(
514
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
515
+ ):
516
+ if past_key_values is not None:
517
+ if isinstance(past_key_values, Cache):
518
+ cache_length = past_key_values.get_seq_length()
519
+ past_length = past_key_values.seen_tokens
520
+ max_cache_length = past_key_values.get_max_cache_shape()
521
+ else:
522
+ cache_length = past_length = past_key_values[0][0].shape[2]
523
+ max_cache_length = None
524
+
525
+ # Keep only the unprocessed tokens:
526
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
527
+ # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as
528
+ # input)
529
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
530
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length):]
531
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
532
+ # input_ids based on the past_length.
533
+ elif past_length < input_ids.shape[1]:
534
+ input_ids = input_ids[:, past_length:]
535
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
536
+
537
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
538
+ if (
539
+ max_cache_length is not None
540
+ and attention_mask is not None
541
+ and cache_length + input_ids.shape[1] > max_cache_length
542
+ ):
543
+ attention_mask = attention_mask[:, -max_cache_length:]
544
+
545
+ position_ids = kwargs.get("position_ids", None)
546
+ if attention_mask is not None and position_ids is None:
547
+ # create position_ids on the fly for batch generation
548
+ position_ids = attention_mask.long().cumsum(-1) - 1
549
+ position_ids.masked_fill_(attention_mask == 0, 1)
550
+ if past_key_values:
551
+ position_ids = position_ids[:, -input_ids.shape[1]:]
552
+
553
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
554
+ if inputs_embeds is not None and past_key_values is None:
555
+ model_inputs = {"inputs_embeds": inputs_embeds}
556
+ else:
557
+ model_inputs = {"input_ids": input_ids}
558
+
559
+ model_inputs.update(
560
+ {
561
+ "position_ids": position_ids,
562
+ "past_key_values": past_key_values,
563
+ "use_cache": kwargs.get("use_cache"),
564
+ "attention_mask": attention_mask,
565
+ }
566
+ )
567
+ return model_inputs
568
+
569
+ @staticmethod
570
+ def _reorder_cache(past_key_values, beam_idx):
571
+ reordered_past = ()
572
+ for layer_past in past_key_values:
573
+ reordered_past += (
574
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
575
+ )
576
+ return reordered_past
577
+
578
+
579
+ class MultimodelHunYuanForCausalLM(HunYuanMoEV1ForCausalLM):
580
+ _tied_weights_keys = ["lm_head.weight"]
581
+
582
+ def __init__(self, config: HunYuanConfig):
583
+ super().__init__(config)
584
+
585
+ @add_start_docstrings_to_model_forward(HUNYUAN_INPUTS_DOCSTRING)
586
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
587
+ def forward(
588
+ self,
589
+ input_ids: torch.LongTensor = None,
590
+ attention_mask: Optional[torch.Tensor] = None,
591
+ position_ids: Optional[torch.LongTensor] = None,
592
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
593
+ inputs_embeds: Optional[torch.FloatTensor] = None,
594
+ labels: Optional[torch.LongTensor] = None,
595
+ imgs: Optional[List[torch.FloatTensor]] = None,
596
+ imgs_pos: Optional[List[int]] = None,
597
+ use_cache: Optional[bool] = None,
598
+ output_attentions: Optional[bool] = None,
599
+ output_hidden_states: Optional[bool] = None,
600
+ return_dict: Optional[bool] = None,
601
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
602
+ r"""
603
+ Args:
604
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
605
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
606
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
607
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
608
+
609
+ Returns:
610
+
611
+ Example:
612
+
613
+ ```python
614
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM
615
+
616
+ >>> model = AutoModelForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
617
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
618
+
619
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
620
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
621
+
622
+ >>> # Generate
623
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
624
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
625
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
626
+ ```"""
627
+ mask_init_id = self.config.mask_init_id
628
+ pad_id = self.config.pad_token_id
629
+ eod_id = self.config.eod_token_id
630
+ image_token_id = self.config.image_token_id
631
+ im_start_id = self.config.im_start_id
632
+ im_end_id = self.config.im_end_id
633
+ video_start_id = self.config.video_start_id
634
+ video_end_id = self.config.video_end_id
635
+
636
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
637
+ output_hidden_states = (
638
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
639
+ )
640
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
641
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
642
+
643
+ outputs = self.model(
644
+ input_ids=input_ids,
645
+ attention_mask=attention_mask,
646
+ position_ids=position_ids,
647
+ past_key_values=past_key_values,
648
+ inputs_embeds=inputs_embeds,
649
+ use_cache=use_cache,
650
+ output_attentions=output_attentions,
651
+ output_hidden_states=output_hidden_states,
652
+ return_dict=return_dict,
653
+ )
654
+
655
+ hidden_states = outputs[0]
656
+ if self.config.pretraining_tp > 1:
657
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
658
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
659
+ logits = torch.cat(logits, dim=-1)
660
+ else:
661
+ logits = self.lm_head(hidden_states)
662
+ logits = logits.float()
663
+
664
+ loss = None
665
+ if labels is not None:
666
+ labels = labels.to(logits.device)
667
+ # Shift so that tokens < n predict n
668
+ shift_logits = logits
669
+ shift_labels = labels
670
+ # Flatten the tokens
671
+ loss_fct = CrossEntropyLoss()
672
+ shift_logits = shift_logits.reshape(-1, self.config.vocab_size)
673
+ shift_labels = shift_labels.reshape(-1)
674
+ shift_tokens = input_ids.reshape(-1)
675
+ # compute loss
676
+ mask = (shift_labels < mask_init_id) & (shift_labels != pad_id) & (shift_labels != image_token_id) & (shift_labels != im_start_id) \
677
+ & (shift_labels != im_end_id) & (shift_labels != video_start_id) & (shift_labels != video_end_id) & (shift_tokens != pad_id) & (shift_tokens != eod_id)
678
+ shift_logits = shift_logits[mask, :]
679
+ shift_labels = shift_labels[mask]
680
+ loss = loss_fct(shift_logits, shift_labels)
681
+
682
+ if not return_dict:
683
+ output = (logits,) + outputs[1:]
684
+ return (loss,) + output if loss is not None else output
685
+
686
+ return CausalLMOutputWithPast(
687
+ loss=loss,
688
+ logits=logits,
689
+ past_key_values=outputs.past_key_values,
690
+ hidden_states=outputs.hidden_states,
691
+ attentions=outputs.attentions,
692
+ )
693
+
694
+ def prepare_inputs_for_generation(
695
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
696
+ ):
697
+ imgs = kwargs.pop("imgs", None)
698
+ imgs_pos = kwargs.pop("imgs_pos", None)
699
+ inputs = super().prepare_inputs_for_generation(
700
+ input_ids, past_key_values=past_key_values, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs
701
+ )
702
+
703
+ if imgs is not None:
704
+ inputs['imgs'] = imgs
705
+ if imgs_pos is not None:
706
+ inputs['imgs_pos'] = imgs_pos
707
+ return inputs
708
+
709
+ @torch.no_grad()
710
+ def generate(
711
+ self,
712
+ inputs: Optional[torch.Tensor] = None,
713
+ attention_mask: Optional[torch.Tensor] = None,
714
+ position_ids: Optional[torch.LongTensor] = None,
715
+ imgs: Optional[List[torch.FloatTensor]] = None,
716
+ imgs_pos: Optional[List[int]] = None,
717
+ **kwargs,
718
+ ) -> Union[GenerateOutput, torch.LongTensor]:
719
+ if "inputs_embeds" in kwargs:
720
+ raise NotImplementedError("`inputs_embeds` is not supported")
721
+
722
+ return super().generate(
723
+ inputs=input_ids,
724
+ position_ids=position_ids,
725
+ attention_mask=attention_mask,
726
+ inputs_embeds=inputs_embeds,
727
+ eos_token_id=self.config.eod_token_id,
728
+ **kwargs
729
+ )
730
+
731
+
732
+ @add_start_docstrings(
733
+ """
734
+ The HunYuan Model transformer with a sequence classification head on top (linear layer).
735
+
736
+ [`HunYuanForSequenceClassification`] uses the last token in order to do the classification, as other causal models
737
+ (e.g. GPT-2) do.
738
+
739
+ Since it does classification on the last token, it requires to know the position of the last token. If a
740
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
741
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
742
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
743
+ each row of the batch).
744
+ """,
745
+ HUNYUAN_START_DOCSTRING,
746
+ )
747
+ class HunYuanForSequenceClassification(HunYuanPreTrainedModel):
748
+ def __init__(self, config):
749
+ super().__init__(config)
750
+ self.num_labels = config.num_labels
751
+ self.model = HunYuanModel(config)
752
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
753
+
754
+ # Initialize weights and apply final processing
755
+ self.post_init()
756
+
757
+ def get_input_embeddings(self):
758
+ return self.model.embed_tokens
759
+
760
+ def set_input_embeddings(self, value):
761
+ self.model.embed_tokens = value
762
+
763
+ @add_start_docstrings_to_model_forward(HUNYUAN_INPUTS_DOCSTRING)
764
+ def forward(
765
+ self,
766
+ input_ids: torch.LongTensor = None,
767
+ attention_mask: Optional[torch.Tensor] = None,
768
+ position_ids: Optional[torch.LongTensor] = None,
769
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
770
+ inputs_embeds: Optional[torch.FloatTensor] = None,
771
+ labels: Optional[torch.LongTensor] = None,
772
+ use_cache: Optional[bool] = None,
773
+ output_attentions: Optional[bool] = None,
774
+ output_hidden_states: Optional[bool] = None,
775
+ return_dict: Optional[bool] = None,
776
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
777
+ r"""
778
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
779
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
780
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
781
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
782
+ """
783
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
784
+
785
+ transformer_outputs = self.model(
786
+ input_ids,
787
+ attention_mask=attention_mask,
788
+ position_ids=position_ids,
789
+ past_key_values=past_key_values,
790
+ inputs_embeds=inputs_embeds,
791
+ use_cache=use_cache,
792
+ output_attentions=output_attentions,
793
+ output_hidden_states=output_hidden_states,
794
+ return_dict=return_dict,
795
+ )
796
+ hidden_states = transformer_outputs[0]
797
+ logits = self.score(hidden_states)
798
+
799
+ if input_ids is not None:
800
+ batch_size = input_ids.shape[0]
801
+ else:
802
+ batch_size = inputs_embeds.shape[0]
803
+
804
+ if self.config.pad_token_id is None and batch_size != 1:
805
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
806
+ if self.config.pad_token_id is None:
807
+ sequence_lengths = -1
808
+ else:
809
+ if input_ids is not None:
810
+ sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1).to(
811
+ logits.device
812
+ )
813
+ else:
814
+ sequence_lengths = -1
815
+
816
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
817
+
818
+ loss = None
819
+ if labels is not None:
820
+ labels = labels.to(logits.device)
821
+ if self.config.problem_type is None:
822
+ if self.num_labels == 1:
823
+ self.config.problem_type = "regression"
824
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
825
+ self.config.problem_type = "single_label_classification"
826
+ else:
827
+ self.config.problem_type = "multi_label_classification"
828
+
829
+ if self.config.problem_type == "regression":
830
+ loss_fct = MSELoss()
831
+ if self.num_labels == 1:
832
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
833
+ else:
834
+ loss = loss_fct(pooled_logits, labels)
835
+ elif self.config.problem_type == "single_label_classification":
836
+ loss_fct = CrossEntropyLoss()
837
+ loss = loss_fct(pooled_logits.reshape(-1, self.num_labels), labels.reshape(-1))
838
+ elif self.config.problem_type == "multi_label_classification":
839
+ loss_fct = BCEWithLogitsLoss()
840
+ loss = loss_fct(pooled_logits, labels)
841
+ if not return_dict:
842
+ output = (pooled_logits,) + transformer_outputs[1:]
843
+ return ((loss,) + output) if loss is not None else output
844
+
845
+ return SequenceClassifierOutputWithPast(
846
+ loss=loss,
847
+ logits=pooled_logits,
848
+ past_key_values=transformer_outputs.past_key_values,
849
+ hidden_states=transformer_outputs.hidden_states,
850
+ attentions=transformer_outputs.attentions,
851
+ )
hy.tiktoken ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3057ab04da86a1081276359ba6a8a57cdaafa4d411c13507b96f6cc333644fbd
3
+ size 4985270248
model-00002-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3018440dddf88261fec7b6df7b6ff504a1f3d8cc8d7caef4f98dd812369331ab
3
+ size 4992312432
model-00003-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3b9f3e4ff67a2f3c3c24bdf8ecac1333b879c86b6dd6b649458cc9d17b21a1d
3
+ size 4992312432
model-00004-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad0065c21c67464110e41ce52b1246c74f8aebd44752009fd66f0b1ab315eceb
3
+ size 4992312432
model-00005-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce245f6cf1e6991be4eaf0353c5204e0c44fc068363135f4476544e914b56c69
3
+ size 4992312432
model-00006-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a8bbd1ea7996be3c9a6a0da8b544e402acde048625307b0c5598c838b264ad75
3
+ size 4992312432
model-00007-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9aebaab82552d506cdff90212a1e06c51ca2ceeaab354b8d14301d23e7cb82a6
3
+ size 4992312432
model-00008-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aec67ab9212f8b8c7e0fc724b12f2085015e13747ba13da336d6bad44881c021
3
+ size 4992312432
model-00009-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:399a73e4f8426029d829755dc48edc0992c20aa575139fe3f13d1be9bf4f2934
3
+ size 4992312432
model-00010-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d226a5cf2b8b789b7121ab063bc2db1ad07818889cf776cea7c064ab8bad58a
3
+ size 4992312432
model-00011-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4cc756f423124a3589bc4203bff1db6b8f8e303bf7e3a8ef24d347ab76fb50ae
3
+ size 4992312600
model-00012-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48f52fab039ab7ae66229b9a7367b2b828f0c87b04d223252dd3bf7dabe085b7
3
+ size 4992312632
model-00013-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0eb025d5fcec82c706972bd68f34249823f40c1b3effeacb7cc99438e6ee5b6
3
+ size 4992312632
model-00014-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:274e299f31c71a3292fe878ee3cab3485b2a3fab6dec9ef8bc907158ac30a271
3
+ size 4992312632
model-00015-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a99f9737b1914b8c1ce27d50f9d40a645ff33d525b322d3162e88e7ba0a53a1
3
+ size 4992312632
model-00016-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea5ae9a35dc905d8f32388e2d1b0e3417910211dcc8b25f8ab0424f94507a672
3
+ size 4992312632
model-00017-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9be97c260719cd6da52d1e30bc34dd5b1844ac86640bc75d7284dc65f911019c
3
+ size 4992312632
model-00018-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:074594f627910e48d5d86aac81424f0acbc3e354c635afe01a20c3b8525a3647
3
+ size 4992312632
model-00019-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65cfedeeed8f1d0c85f36eb8cca2653151a3c7597d574e96c7747aaf05b5c6eb
3
+ size 4992312632
model-00020-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8aaa1fb52cf9c9f588653222ec0b88ebe13f553a70f96aa8d50d013824d6d9d1
3
+ size 4992312632
model-00021-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b445fe65502467a3085397602f17223a38e71b66692af9cc644cfbe72b920d69
3
+ size 4992312632
model-00022-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:761851085e1732d60d2b1572b38f1ecdab2c8cba3656ee938e5af7133391d999
3
+ size 4992312632
model-00023-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1dadf06c900e764fcdedee6e82a53729c108742c108550f85bdf8e31f72570bd
3
+ size 4992312632
model-00024-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e5d2cedd5dfb913d1dcd1d7b577f13f5cb0f1ffc05044adc5e242589c9c35d63
3
+ size 4992312632
model-00025-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59ded8fd9582ce18e20af6aac5b7b6f6e98d1804d6e220fae669836c8ef733b7
3
+ size 4992312632
model-00026-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c50508a23df852b3c5997bad69d0419ec7ed3a5810f16b39857f69d78576131f
3
+ size 4992312632
model-00027-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f8f918076d44128dec6c078e5f2b94b567f6d5b9fcf2fddf7ad81e9a1bc4341
3
+ size 4992312632
model-00028-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e138c2158aa0a4a5087b3b99c31e553e676898acad1040abc42ba8b61035a015
3
+ size 4992312632
model-00029-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1efa53df7fbc9271f5cf99a6654972c50c630710efe45162b27d5cff288e37ee
3
+ size 4992312632
model-00030-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ffa461ecfe206120663525656840b0d5577cfa460fd3e7ff297d62f0e5dfb990
3
+ size 4992312632
model-00031-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be3ab112fc526c7c087f671dfa893e217af729bc3fa55ceed9b3b51dd9470bc6
3
+ size 4992312632
model-00032-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:30ce68700e6981065d55e45c95524e3974028ea95af178885da42eebd3cb22a7
3
+ size 4992312632
model-00033-of-00033.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb33d5b889131b4b1864fe947b516fbde8e7fb6f60b45a12bc61a2d0bf09f4cb
3
+ size 1056994712
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_hunyuan.py ADDED
@@ -0,0 +1,1728 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (C) 2024 THL A29 Limited, a Tencent company. All rights reserved.
2
+ #
3
+ # Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # https://github.com/Tencent/Tencent-Hunyuan-Large/blob/main/License.docx
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ #
15
+ """ PyTorch HunYuan model."""
16
+
17
+ import math
18
+ import warnings
19
+ from typing import List, Optional, Tuple, Union
20
+
21
+ import torch
22
+ from torch import Tensor
23
+ import torch.nn.functional as F
24
+ import torch.utils.checkpoint
25
+ from torch import nn
26
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
27
+
28
+ from transformers.activations import ACT2FN
29
+ from transformers.cache_utils import Cache, DynamicCache
30
+ from transformers.modeling_attn_mask_utils import (
31
+ AttentionMaskConverter,
32
+ _prepare_4d_attention_mask,
33
+ _prepare_4d_causal_attention_mask,
34
+ _prepare_4d_causal_attention_mask_for_sdpa,
35
+ )
36
+ from transformers.modeling_outputs import (
37
+ BaseModelOutputWithPast,
38
+ CausalLMOutputWithPast,
39
+ SequenceClassifierOutputWithPast
40
+ )
41
+ from transformers.modeling_utils import PreTrainedModel
42
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS, is_torch_greater_or_equal_than_1_13
43
+ from transformers.utils import (
44
+ add_start_docstrings,
45
+ add_start_docstrings_to_model_forward,
46
+ is_flash_attn_2_available,
47
+ is_flash_attn_greater_or_equal_2_10,
48
+ logging,
49
+ replace_return_docstrings,
50
+ )
51
+ from transformers.utils.import_utils import is_torch_fx_available
52
+ from .configuration_hunyuan import HunYuanConfig
53
+
54
+
55
+ if is_flash_attn_2_available():
56
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
57
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
58
+
59
+
60
+ # This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
61
+ # It means that the function will not be traced through and simply appear as a node in the graph.
62
+ if is_torch_fx_available():
63
+ if not is_torch_greater_or_equal_than_1_13:
64
+ import torch.fx
65
+
66
+ _prepare_4d_causal_attention_mask = torch.fx.wrap(_prepare_4d_causal_attention_mask)
67
+
68
+
69
+ logger = logging.get_logger(__name__)
70
+
71
+ _CONFIG_FOR_DOC = "HunYuanConfig"
72
+
73
+
74
+ def topkgating(logits: Tensor, topk: int):
75
+ logits = logits.float()
76
+ gates = F.softmax(logits, dim=1)
77
+ # expert_capacity = topk * gates.shape[0]
78
+ expert_capacity = max(topk, topk * gates.shape[0] // gates.shape[1])
79
+ num_experts = int(gates.shape[1])
80
+ # Top-k router probability and corresponding expert indices for each token.
81
+ # Shape: [tokens_per_group, num_selected_experts].
82
+ expert_gate, expert_index = torch.topk(gates, topk)
83
+ expert_mask = F.one_hot(expert_index, num_experts)
84
+ # For a given token, determine if it was routed to a given expert.
85
+ # Shape: [tokens_per_group, num_experts]
86
+ expert_mask_aux = expert_mask.max(dim=-2)[0]
87
+ tokens_per_group_and_expert = torch.mean(expert_mask_aux.float(), dim=-2)
88
+ router_prob_per_group_and_expert = torch.mean(gates.float(), dim=-2)
89
+ l_aux = num_experts**2 * torch.mean(tokens_per_group_and_expert * router_prob_per_group_and_expert)
90
+
91
+ gates_s = torch.clamp(
92
+ torch.matmul(expert_mask.float(), gates.unsqueeze(-1)).sum(dim=1), min=torch.finfo(gates.dtype).eps
93
+ )
94
+ router_probs = gates / gates_s
95
+ # Make num_selected_experts the leading axis to ensure that top-1 choices
96
+ # have priority over top-2 choices, which have priority over top-3 choices,
97
+ # etc.
98
+ expert_index = torch.transpose(expert_index, 0, 1)
99
+ # Shape: [num_selected_experts * tokens_per_group]
100
+ expert_index = expert_index.reshape(-1)
101
+
102
+ # Create mask out of indices.
103
+ # Shape: [tokens_per_group * num_selected_experts, num_experts].
104
+ expert_mask = F.one_hot(expert_index, num_experts).to(torch.int32)
105
+ exp_counts = torch.sum(expert_mask, dim=0).detach()
106
+
107
+ # Experts have a fixed capacity that we cannot exceed. A token's priority
108
+ # within the expert's buffer is given by the masked, cumulative capacity of
109
+ # its target expert.
110
+ # Shape: [tokens_per_group * num_selected_experts, num_experts].
111
+ token_priority = torch.cumsum(expert_mask, dim=0) * expert_mask - 1
112
+ # Shape: [num_selected_experts, tokens_per_group, num_experts].
113
+ token_priority = token_priority.reshape((topk, -1, num_experts))
114
+ # Shape: [tokens_per_group, num_selected_experts, num_experts].
115
+ token_priority = torch.transpose(token_priority, 0, 1)
116
+ # For each token, across all selected experts, select the only non-negative
117
+ # (unmasked) priority. Now, for group G routing to expert E, token T has
118
+ # non-negative priority (i.e. token_priority[G,T,E] >= 0) if and only if E
119
+ # is its targeted expert.
120
+ # Shape: [tokens_per_group, num_experts].
121
+ token_priority = torch.max(token_priority, dim=1)[0]
122
+
123
+ # Token T can only be routed to expert E if its priority is positive and
124
+ # less than the expert capacity. One-hot matrix will ignore indices outside
125
+ # the range [0, expert_capacity).
126
+ # Shape: [tokens_per_group, num_experts, expert_capacity].
127
+ valid_mask = torch.logical_and(token_priority >= 0, token_priority < expert_capacity)
128
+ token_priority = torch.masked_fill(token_priority, ~valid_mask, 0)
129
+ dispatch_mask = F.one_hot(token_priority, expert_capacity).to(torch.bool)
130
+ valid_mask = valid_mask.unsqueeze(-1).expand(-1, -1, expert_capacity)
131
+ dispatch_mask = torch.masked_fill(dispatch_mask, ~valid_mask, 0)
132
+
133
+ # The combine array will be used for combining expert outputs, scaled by the
134
+ # router probabilities. Shape: [num_groups, tokens_per_group, num_experts,
135
+ # expert_capacity].
136
+ combine_weights = torch.einsum("...te,...tec->...tec", router_probs, dispatch_mask)
137
+ exp_counts_capacity = torch.sum(dispatch_mask)
138
+ exp_capacity_rate = exp_counts_capacity / (logits.shape[0]*topk)
139
+
140
+ return [l_aux, exp_capacity_rate], combine_weights, dispatch_mask, exp_counts
141
+
142
+
143
+ def top1gating(logits: Tensor, random_routing_dropped_token: bool = False):
144
+ """Implements Top1Gating on logits."""
145
+ # everything is in fp32 in this function
146
+ logits = logits.float()
147
+ gates = F.softmax(logits, dim=1)
148
+ capacity = gates.shape[0]
149
+
150
+ # Create a mask for 1st's expert per token
151
+ # noisy gating
152
+ indices1_s = torch.argmax(gates, dim=1)
153
+ num_experts = int(gates.shape[1])
154
+ mask1 = F.one_hot(indices1_s, num_classes=num_experts)
155
+
156
+ # gating decisions
157
+ # exp_counts = torch.sum(mask1, dim=0).detach().to('cpu')
158
+ exp_counts = torch.sum(mask1, dim=0).detach()
159
+
160
+ # Compute l_aux
161
+ me = torch.mean(gates, dim=0)
162
+ ce = torch.mean(mask1.float(), dim=0)
163
+ l_aux = torch.sum(me * ce) * num_experts
164
+ mask1_rand = mask1
165
+
166
+ top_idx = torch.topk(mask1_rand, k=capacity, dim=0)[1]
167
+
168
+ new_mask1 = mask1 * torch.zeros_like(mask1).scatter_(0, top_idx, 1)
169
+ mask1 = new_mask1
170
+ mask1_bk = mask1
171
+ if random_routing_dropped_token:
172
+ not_full = capacity - new_mask1.sum(dim=0)
173
+ sorted_notfull, indices_notfull = torch.sort(not_full, descending=True)
174
+ sorted_notfull = sorted_notfull.to(torch.int64)
175
+ not_full_experts_ids = torch.repeat_interleave(indices_notfull, sorted_notfull)
176
+ shuffle_not_full_ids = torch.randperm(not_full_experts_ids.shape[0])
177
+ not_full_experts_ids = not_full_experts_ids[shuffle_not_full_ids]
178
+ indices1_s_after_drop = torch.argmax(new_mask1, dim=1)
179
+ # get drop idx
180
+ drop_mask = 1 - new_mask1.sum(dim=1)
181
+ drop_mask = drop_mask.bool()
182
+ drop_idx = drop_mask.nonzero().view(-1)
183
+ drop_num = drop_mask.sum().to(torch.int64)
184
+ indices1_s_after_drop.scatter_(0, drop_idx, not_full_experts_ids[:drop_num])
185
+ nodrop_mask1 = F.one_hot(indices1_s_after_drop, num_classes=num_experts)
186
+ mask1 = nodrop_mask1
187
+
188
+ # Compute locations in capacity buffer
189
+ locations1 = torch.cumsum(mask1, dim=0) - 1
190
+
191
+ # Store the capacity location for each token
192
+ locations1_s = torch.sum(locations1 * mask1, dim=1)
193
+
194
+ # Normalize gate probabilities
195
+ mask1_float = mask1.float()
196
+ gates = gates * mask1_float
197
+
198
+ locations1_sc = F.one_hot(locations1_s, num_classes=capacity).float() # one hot to float
199
+ combine_weights = torch.einsum("se,sc->sec", gates, locations1_sc)
200
+
201
+ dispatch_mask = combine_weights.bool()
202
+
203
+ exp_counts_capacity = torch.sum(mask1_bk)
204
+ exp_capacity_rate = exp_counts_capacity / (logits.shape[0])
205
+ return [l_aux, exp_capacity_rate], combine_weights, dispatch_mask, exp_counts
206
+
207
+
208
+ def _get_unpad_data(attention_mask):
209
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
210
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
211
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
212
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
213
+ return (
214
+ indices,
215
+ cu_seqlens,
216
+ max_seqlen_in_batch,
217
+ )
218
+
219
+
220
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
221
+ warnings.warn(
222
+ "Calling `transformers.models.llama.modeling_llama._prepare_4d_attention_mask` is deprecated and will be "
223
+ "removed in v4.37. Use `transformers.modeling_attn_mask_utils._prepare_4d_attention_mask"
224
+ )
225
+ return _prepare_4d_attention_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)
226
+
227
+
228
+ def _make_causal_mask(
229
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
230
+ ):
231
+ warnings.warn(
232
+ "Calling `transformers.models.llama.modeling_llama._make_causal_mask` is deprecated and will be removed in "
233
+ "v4.37. Use `transformers.models.llama.modeling_llama.AttentionMaskConverter._make_causal_mask"
234
+ )
235
+ return AttentionMaskConverter._make_causal_mask(
236
+ input_ids_shape=input_ids_shape, dtype=dtype, device=device, past_key_values_length=past_key_values_length
237
+ )
238
+
239
+
240
+ class HunYuanRMSNorm(nn.Module):
241
+ def __init__(self, hidden_size, eps=1e-6):
242
+ """
243
+ HunYuanRMSNorm is equivalent to T5LayerNorm
244
+ """
245
+ super().__init__()
246
+ self.weight = nn.Parameter(torch.ones(hidden_size))
247
+ self.variance_epsilon = eps
248
+
249
+ def forward(self, hidden_states):
250
+ input_dtype = hidden_states.dtype
251
+ hidden_states = hidden_states.to(torch.float32)
252
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
253
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
254
+ return self.weight * hidden_states.to(input_dtype)
255
+
256
+
257
+ ALL_LAYERNORM_LAYERS.append(HunYuanRMSNorm)
258
+
259
+
260
+ class HunYuanRotaryEmbedding(nn.Module):
261
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
262
+ super().__init__()
263
+
264
+ self.dim = dim
265
+ self.max_position_embeddings = max_position_embeddings
266
+ self.base = base
267
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
268
+ # inv_freq = inv_freq.bfloat16()
269
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
270
+
271
+ # Build here to make `torch.jit.trace` work.
272
+ self._set_cos_sin_cache(
273
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
274
+ )
275
+
276
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
277
+ self.max_seq_len_cached = seq_len
278
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
279
+
280
+ self.inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
281
+ freqs = torch.outer(t, self.inv_freq)
282
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
283
+ emb = torch.cat((freqs, freqs), dim=-1).float()
284
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
285
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
286
+
287
+ def forward(self, x, seq_len=None):
288
+ # x: [bs, num_attention_heads, seq_len, head_size]
289
+ if seq_len > self.max_seq_len_cached or self.inv_freq.dtype != torch.float32:
290
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
291
+
292
+ return (
293
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
294
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
295
+ )
296
+
297
+
298
+ class HunYuanLinearScalingRotaryEmbedding(HunYuanRotaryEmbedding):
299
+ """HunYuanRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
300
+
301
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
302
+ self.scaling_factor = scaling_factor
303
+ super().__init__(dim, max_position_embeddings, base, device)
304
+
305
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
306
+ self.max_seq_len_cached = seq_len
307
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
308
+ t = t / self.scaling_factor
309
+
310
+ freqs = torch.outer(t, self.inv_freq)
311
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
312
+ emb = torch.cat((freqs, freqs), dim=-1)
313
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
314
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
315
+
316
+
317
+ class HunYuanDynamicNTKScalingRotaryEmbedding(HunYuanRotaryEmbedding):
318
+ """
319
+ HunYuanRotaryEmbedding extended with Dynamic NTK scaling.
320
+ Credits to the Reddit users /u/bloc97 and /u/emozilla
321
+ """
322
+
323
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
324
+ self.scaling_factor = scaling_factor
325
+ super().__init__(dim, max_position_embeddings, base, device)
326
+
327
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
328
+ self.max_seq_len_cached = seq_len
329
+
330
+ if seq_len > self.max_position_embeddings:
331
+ base = self.base * (
332
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
333
+ ) ** (self.dim / (self.dim - 2))
334
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
335
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
336
+
337
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
338
+
339
+ freqs = torch.outer(t, self.inv_freq)
340
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
341
+ emb = torch.cat((freqs, freqs), dim=-1)
342
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
343
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
344
+
345
+
346
+ class HunYuanDynamicNTKAlphaRotaryEmbedding(HunYuanRotaryEmbedding):
347
+ """
348
+ HunYuanRotaryEmbedding extended with Dynamic NTK scaling.
349
+ Credits to the Reddit users /u/bloc97 and /u/emozilla
350
+ """
351
+
352
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_alpha=1.0):
353
+ self.scaling_alpha = scaling_alpha
354
+ super().__init__(dim, max_position_embeddings, base, device)
355
+
356
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
357
+ self.max_seq_len_cached = seq_len
358
+ base = self.base * self.scaling_alpha ** (self.dim / (self.dim-2))
359
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
360
+
361
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
362
+
363
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
364
+
365
+ freqs = torch.outer(t, self.inv_freq)
366
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
367
+ emb = torch.cat((freqs, freqs), dim=-1)
368
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
369
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
370
+
371
+
372
+ def rotate_half(x):
373
+ """Rotates half the hidden dims of the input."""
374
+ x1 = x[..., : x.shape[-1] // 2]
375
+ x2 = x[..., x.shape[-1] // 2:]
376
+ return torch.cat((-x2, x1), dim=-1)
377
+
378
+
379
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
380
+ """Applies Rotary Position Embedding to the query and key tensors.
381
+
382
+ Args:
383
+ q (`torch.Tensor`): The query tensor.
384
+ k (`torch.Tensor`): The key tensor.
385
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
386
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
387
+ position_ids (`torch.Tensor`):
388
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
389
+ used to pass offsetted position ids when working with a KV-cache.
390
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
391
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
392
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
393
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
394
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
395
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
396
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
397
+ Returns:
398
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
399
+ """
400
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
401
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
402
+ q_embed = (q * cos) + (rotate_half(q) * sin)
403
+ k_embed = (k * cos) + (rotate_half(k) * sin)
404
+ return q_embed, k_embed
405
+
406
+
407
+ class HunYuanMLP(nn.Module):
408
+ def __init__(self, config: HunYuanConfig, layer_idx=None, is_shared_mlp=False):
409
+ super().__init__()
410
+ self.config = config
411
+ self.layer_idx = layer_idx
412
+ self.hidden_size = config.hidden_size
413
+ if is_shared_mlp:
414
+ self.intermediate_size = config.intermediate_size * config.num_shared_expert[0]
415
+ else:
416
+ self.intermediate_size = config.intermediate_size
417
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
418
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
419
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
420
+ self.act_fn = ACT2FN[config.hidden_act]
421
+
422
+ def forward(self, x):
423
+ if self.config.pretraining_tp > 1:
424
+ slice = self.intermediate_size // self.config.pretraining_tp
425
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
426
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
427
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
428
+
429
+ gate_proj = torch.cat(
430
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
431
+ )
432
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
433
+
434
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
435
+ down_proj = [
436
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
437
+ ]
438
+ down_proj = sum(down_proj)
439
+ else:
440
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
441
+
442
+ return down_proj
443
+
444
+
445
+ class HunYuanTopKGate(nn.Module):
446
+ def __init__(self, config: HunYuanConfig, layer_idx: Optional[int] = None):
447
+ super().__init__()
448
+ self.config = config
449
+ self.layer_idx = layer_idx
450
+ self.moe_topk = config.moe_topk
451
+ self.drop_tokens = config.moe_drop_tokens
452
+ self.min_capacity = 8
453
+ self.random_routing_dropped_token = config.moe_random_routing_dropped_token
454
+ self.wg = nn.Linear(config.hidden_size, config.num_experts, bias=False, dtype=torch.float32)
455
+
456
+ def forward(self, hidden_states):
457
+ bsz, seq_len, hidden_size = hidden_states.shape
458
+ hidden_states = hidden_states.reshape(-1, hidden_size)
459
+ if self.wg.weight.dtype == torch.float32:
460
+ hidden_states = hidden_states.float()
461
+ logits = self.wg(hidden_states)
462
+ if self.moe_topk == 1:
463
+ gate_output = top1gating(logits, random_routing_dropped_token=self.random_routing_dropped_token)
464
+ else:
465
+ gate_output = topkgating(logits, self.moe_topk[0])
466
+
467
+ return gate_output
468
+
469
+
470
+ class HunYuanMoE(nn.Module):
471
+ def __init__(self, config: HunYuanConfig, layer_idx: Optional[int] = None):
472
+ super().__init__()
473
+ self.config = config
474
+ self.layer_idx = layer_idx
475
+ self.moe_topk = config.moe_topk
476
+ self.num_experts = config.num_experts
477
+ if config.use_mixed_mlp_moe:
478
+ self.shared_mlp = HunYuanMLP(config, layer_idx=layer_idx, is_shared_mlp=True)
479
+ self.gate = HunYuanTopKGate(config, layer_idx=layer_idx)
480
+ self.experts = nn.ModuleList(
481
+ [HunYuanMLP(config, layer_idx=layer_idx, is_shared_mlp=False) for _ in range(config.num_experts)]
482
+ )
483
+
484
+ def forward(self, hidden_states):
485
+ bsz, seq_len, hidden_size = hidden_states.shape
486
+
487
+ if self.config.use_mixed_mlp_moe:
488
+ hidden_states_mlp = self.shared_mlp(hidden_states)
489
+
490
+ l_moe, combine_weights, dispatch_mask, exp_counts = self.gate(hidden_states)
491
+
492
+ reshaped_input = hidden_states.reshape(-1, hidden_size)
493
+
494
+ dispatched_input = torch.einsum("sec,sm->ecm", dispatch_mask.type_as(hidden_states), reshaped_input)
495
+
496
+ chunks = dispatched_input.chunk(self.num_experts, dim=0)
497
+ expert_outputs = []
498
+ for chunk, expert in zip(chunks, self.experts):
499
+ expert_outputs.append(expert(chunk))
500
+
501
+ expert_output = torch.cat(expert_outputs, dim=0)
502
+ combined_output = torch.einsum("sec,ecm->sm", combine_weights.type_as(hidden_states), expert_output)
503
+ combined_output = combined_output.reshape(bsz, seq_len, hidden_size)
504
+
505
+ if self.config.use_mixed_mlp_moe:
506
+ output = hidden_states_mlp + combined_output
507
+ else:
508
+ output = combined_output
509
+
510
+ return output
511
+
512
+
513
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
514
+ """
515
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
516
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
517
+ """
518
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
519
+ if n_rep == 1:
520
+ return hidden_states
521
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
522
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
523
+
524
+
525
+ class HunYuanAttention(nn.Module):
526
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
527
+
528
+ def __init__(self, config: HunYuanConfig, layer_idx: Optional[int] = None):
529
+ super().__init__()
530
+ self.config = config
531
+ self.layer_idx = layer_idx
532
+ # layer_idx 从 0 开始
533
+ self.attention_type = 'cross' if config.use_cla and layer_idx % config.cla_share_factor != 0 else 'self'
534
+ if layer_idx is None:
535
+ logger.warning_once(
536
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
537
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
538
+ "when creating this class."
539
+ )
540
+
541
+ self.attention_dropout = config.attention_dropout
542
+ self.hidden_size = config.hidden_size
543
+ self.num_heads = config.num_attention_heads
544
+ self.head_dim = self.hidden_size // self.num_heads
545
+ self.num_key_value_heads = config.num_key_value_heads
546
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
547
+ self.max_position_embeddings = config.max_position_embeddings
548
+ self.rope_theta = config.rope_theta
549
+ self.is_causal = True
550
+ self.use_qk_norm = config.use_qk_norm
551
+
552
+ if (self.head_dim * self.num_heads) != self.hidden_size:
553
+ raise ValueError(
554
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
555
+ f" and `num_heads`: {self.num_heads})."
556
+ )
557
+
558
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
559
+ if self.attention_type == 'self':
560
+ self.k_proj = nn.Linear(
561
+ self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias
562
+ )
563
+ self.v_proj = nn.Linear(
564
+ self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias
565
+ )
566
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias)
567
+ if self.use_qk_norm:
568
+ self.query_layernorm = HunYuanRMSNorm(self.head_dim, eps=config.rms_norm_eps)
569
+ self.key_layernorm = HunYuanRMSNorm(self.head_dim, eps=config.rms_norm_eps)
570
+ self._init_rope()
571
+
572
+ def _init_rope(self):
573
+ if self.config.rope_scaling is None:
574
+ self.rotary_emb = HunYuanRotaryEmbedding(
575
+ self.head_dim,
576
+ max_position_embeddings=self.max_position_embeddings,
577
+ base=self.rope_theta,
578
+ )
579
+ else:
580
+ scaling_type = self.config.rope_scaling["type"]
581
+ scaling_factor = self.config.rope_scaling["factor"]
582
+ scaling_alpha = self.config.rope_scaling["alpha"]
583
+ if scaling_type == "linear":
584
+ self.rotary_emb = HunYuanLinearScalingRotaryEmbedding(
585
+ self.head_dim,
586
+ max_position_embeddings=self.max_position_embeddings,
587
+ scaling_factor=scaling_factor,
588
+ base=self.rope_theta,
589
+ )
590
+ elif scaling_type == "dynamic":
591
+ if scaling_alpha:
592
+ self.rotary_emb = HunYuanDynamicNTKAlphaRotaryEmbedding(
593
+ self.head_dim,
594
+ max_position_embeddings=self.max_position_embeddings,
595
+ scaling_alpha=scaling_alpha,
596
+ base=self.rope_theta,
597
+ )
598
+ else:
599
+ self.rotary_emb = HunYuanDynamicNTKScalingRotaryEmbedding(
600
+ self.head_dim,
601
+ max_position_embeddings=self.max_position_embeddings,
602
+ scaling_factor=scaling_factor,
603
+ base=self.rope_theta,
604
+ )
605
+ else:
606
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
607
+
608
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
609
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
610
+
611
+ def forward(
612
+ self,
613
+ hidden_states: torch.Tensor,
614
+ attention_mask: Optional[torch.Tensor] = None,
615
+ position_ids: Optional[torch.LongTensor] = None,
616
+ past_key_value: Optional[Cache] = None,
617
+ output_attentions: bool = False,
618
+ use_cache: bool = False,
619
+ kv_states: torch.Tensor = None,
620
+ **kwargs,
621
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
622
+ if "padding_mask" in kwargs:
623
+ warnings.warn(
624
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use "
625
+ "`attention_mask` instead.`"
626
+ )
627
+
628
+ bsz, q_len, _ = hidden_states.size()
629
+
630
+ if self.config.pretraining_tp > 1:
631
+ query_slices = self.q_proj.weight.split(
632
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
633
+ )
634
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
635
+ query_states = torch.cat(query_states, dim=-1)
636
+
637
+ if self.attention_type == "cross" and kv_states is not None and isinstance(kv_states, tuple):
638
+ orig_key_states, orig_value_states = kv_states
639
+ key_states, value_states = kv_states
640
+ else:
641
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
642
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
643
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
644
+
645
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
646
+ key_states = torch.cat(key_states, dim=-1)
647
+
648
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
649
+ value_states = torch.cat(value_states, dim=-1)
650
+ orig_key_states, orig_value_states = key_states, value_states
651
+
652
+ else:
653
+ query_states = self.q_proj(hidden_states)
654
+ if self.attention_type == "cross" and kv_states is not None and isinstance(kv_states, tuple):
655
+ orig_key_states, orig_value_states = kv_states
656
+ key_states, value_states = kv_states
657
+ else:
658
+ key_states = self.k_proj(hidden_states)
659
+ value_states = self.v_proj(hidden_states)
660
+ orig_key_states, orig_value_states = key_states, value_states
661
+
662
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
663
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
664
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
665
+
666
+ kv_seq_len = key_states.shape[-2]
667
+ if past_key_value is not None:
668
+ if self.layer_idx is None:
669
+ raise ValueError(
670
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
671
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
672
+ "with a layer index."
673
+ )
674
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
675
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
676
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
677
+
678
+ if self.use_qk_norm:
679
+ query_states = self.query_layernorm(query_states)
680
+ key_states = self.key_layernorm(key_states)
681
+
682
+ if past_key_value is not None:
683
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
684
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
685
+
686
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
687
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
688
+
689
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
690
+
691
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
692
+ raise ValueError(
693
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
694
+ f" {attn_weights.size()}"
695
+ )
696
+
697
+ if attention_mask is not None:
698
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
699
+ raise ValueError(
700
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
701
+ )
702
+ attn_weights = attn_weights + attention_mask
703
+
704
+ # upcast attention to fp32
705
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
706
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
707
+ attn_output = torch.matmul(attn_weights, value_states)
708
+
709
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
710
+ raise ValueError(
711
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
712
+ f" {attn_output.size()}"
713
+ )
714
+
715
+ attn_output = attn_output.transpose(1, 2).contiguous()
716
+
717
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
718
+
719
+ if self.config.pretraining_tp > 1:
720
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
721
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
722
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
723
+ else:
724
+ attn_output = self.o_proj(attn_output)
725
+
726
+ if not output_attentions:
727
+ attn_weights = None
728
+
729
+ return attn_output, attn_weights, past_key_value, (orig_key_states, orig_value_states)
730
+
731
+
732
+ class HunYuanFlashAttention2(HunYuanAttention):
733
+ """
734
+ HunYuan flash attention module. This module inherits from `HunYuanAttention` as the weights of the module stays
735
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
736
+ flash attention and deal with padding tokens in case the input contains any of them.
737
+ """
738
+
739
+ def __init__(self, *args, **kwargs):
740
+ super().__init__(*args, **kwargs)
741
+
742
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
743
+
744
+ def forward(
745
+ self,
746
+ hidden_states: torch.Tensor,
747
+ attention_mask: Optional[torch.LongTensor] = None,
748
+ position_ids: Optional[torch.LongTensor] = None,
749
+ past_key_value: Optional[Cache] = None,
750
+ output_attentions: bool = False,
751
+ use_cache: bool = False,
752
+ kv_states: torch.Tensor = None,
753
+ **kwargs,
754
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
755
+ # HunYuanFlashAttention2 attention does not support output_attentions
756
+ if "padding_mask" in kwargs:
757
+ warnings.warn(
758
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use "
759
+ "`attention_mask` instead.`"
760
+ )
761
+
762
+ # overwrite attention_mask with padding_mask
763
+ attention_mask = kwargs.pop("padding_mask")
764
+
765
+ bsz, q_len, _ = hidden_states.size()
766
+
767
+ query_states = self.q_proj(hidden_states)
768
+ if self.attention_type == "cross" and kv_states is not None and isinstance(kv_states, tuple):
769
+ orig_key_states, orig_value_states = kv_states
770
+ key_states, value_states = kv_states
771
+ else:
772
+ key_states = self.k_proj(hidden_states)
773
+ value_states = self.v_proj(hidden_states)
774
+ orig_key_states, orig_value_states = key_states, value_states
775
+
776
+ # Flash attention requires the input to have the shape
777
+ # batch_size x seq_length x head_dim x hidden_dim
778
+ # therefore we just need to keep the original shape
779
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
780
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
781
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
782
+
783
+ kv_seq_len = key_states.shape[-2]
784
+ if past_key_value is not None:
785
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
786
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
787
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
788
+
789
+ if self.use_qk_norm:
790
+ query_states = self.query_layernorm(query_states)
791
+ key_states = self.key_layernorm(key_states)
792
+
793
+ if past_key_value is not None:
794
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
795
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
796
+
797
+ query_states = query_states.transpose(1, 2)
798
+ key_states = key_states.transpose(1, 2)
799
+ value_states = value_states.transpose(1, 2)
800
+
801
+ dropout_rate = self.attention_dropout if self.training else 0.0
802
+
803
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
804
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
805
+ # cast them back in the correct dtype just to be sure everything works as expected.
806
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
807
+ # in fp32. (HunYuanRMSNorm handles it correctly)
808
+
809
+ input_dtype = query_states.dtype
810
+ if input_dtype == torch.float32:
811
+ # Handle the case where the model is quantized
812
+ if hasattr(self.config, "_pre_quantization_dtype"):
813
+ target_dtype = self.config._pre_quantization_dtype
814
+ else:
815
+ target_dtype = self.q_proj.weight.dtype
816
+
817
+ logger.warning_once(
818
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
819
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
820
+ f" {target_dtype}."
821
+ )
822
+
823
+ query_states = query_states.to(target_dtype)
824
+ key_states = key_states.to(target_dtype)
825
+ value_states = value_states.to(target_dtype)
826
+
827
+ attn_output = self._flash_attention_forward(
828
+ query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
829
+ )
830
+
831
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
832
+ attn_output = self.o_proj(attn_output)
833
+
834
+ return attn_output, None, past_key_value, (orig_key_states, orig_value_states)
835
+
836
+ def _flash_attention_forward(
837
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
838
+ ):
839
+ """
840
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
841
+ first unpad the input, then computes the attention scores and pad the final attention scores.
842
+
843
+ Args:
844
+ query_states (`torch.Tensor`):
845
+ Input query states to be passed to Flash Attention API
846
+ key_states (`torch.Tensor`):
847
+ Input key states to be passed to Flash Attention API
848
+ value_states (`torch.Tensor`):
849
+ Input value states to be passed to Flash Attention API
850
+ attention_mask (`torch.Tensor`):
851
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
852
+ position of padding tokens and 1 for the position of non-padding tokens.
853
+ dropout (`int`, *optional*):
854
+ Attention dropout
855
+ softmax_scale (`float`, *optional*):
856
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
857
+ """
858
+ if not self._flash_attn_uses_top_left_mask:
859
+ causal = self.is_causal
860
+ else:
861
+ causal = self.is_causal and query_length != 1
862
+
863
+ # Contains at least one padding token in the sequence
864
+ if attention_mask is not None:
865
+ batch_size = query_states.shape[0]
866
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
867
+ query_states, key_states, value_states, attention_mask, query_length
868
+ )
869
+
870
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
871
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
872
+
873
+ attn_output_unpad = flash_attn_varlen_func(
874
+ query_states,
875
+ key_states,
876
+ value_states,
877
+ cu_seqlens_q=cu_seqlens_q,
878
+ cu_seqlens_k=cu_seqlens_k,
879
+ max_seqlen_q=max_seqlen_in_batch_q,
880
+ max_seqlen_k=max_seqlen_in_batch_k,
881
+ dropout_p=dropout,
882
+ softmax_scale=softmax_scale,
883
+ causal=causal,
884
+ )
885
+
886
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
887
+ else:
888
+ attn_output = flash_attn_func(
889
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
890
+ )
891
+
892
+ return attn_output
893
+
894
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
895
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
896
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
897
+
898
+ key_layer = index_first_axis(
899
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
900
+ )
901
+ value_layer = index_first_axis(
902
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
903
+ )
904
+ if query_length == kv_seq_len:
905
+ query_layer = index_first_axis(
906
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
907
+ )
908
+ cu_seqlens_q = cu_seqlens_k
909
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
910
+ indices_q = indices_k
911
+ elif query_length == 1:
912
+ max_seqlen_in_batch_q = 1
913
+ cu_seqlens_q = torch.arange(
914
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
915
+ ) # There is a memcpy here, that is very bad.
916
+ indices_q = cu_seqlens_q[:-1]
917
+ query_layer = query_layer.squeeze(1)
918
+ else:
919
+ # The -q_len: slice assumes left padding.
920
+ attention_mask = attention_mask[:, -query_length:]
921
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
922
+
923
+ return (
924
+ query_layer,
925
+ key_layer,
926
+ value_layer,
927
+ indices_q,
928
+ (cu_seqlens_q, cu_seqlens_k),
929
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
930
+ )
931
+
932
+
933
+ class HunYuanSdpaAttention(HunYuanAttention):
934
+ """
935
+ HunYuan attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
936
+ `HunYuanAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt
937
+ to SDPA API.
938
+ """
939
+
940
+ # Adapted from HunYuanAttention.forward
941
+ def forward(
942
+ self,
943
+ hidden_states: torch.Tensor,
944
+ attention_mask: Optional[torch.Tensor] = None,
945
+ position_ids: Optional[torch.LongTensor] = None,
946
+ past_key_value: Optional[Cache] = None,
947
+ output_attentions: bool = False,
948
+ use_cache: bool = False,
949
+ kv_states: torch.Tensor = None,
950
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
951
+ if output_attentions:
952
+ logger.warning_once(
953
+ 'HunYuanModel is using HunYuanSdpaAttention,'
954
+ 'but `torch.nn.functional.scaled_dot_product_attention`'
955
+ 'does not support `output_attentions=True`. Falling back to the manual attention implementation, '
956
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. '
957
+ 'This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
958
+ )
959
+ return super().forward(
960
+ hidden_states=hidden_states,
961
+ attention_mask=attention_mask,
962
+ position_ids=position_ids,
963
+ past_key_value=past_key_value,
964
+ output_attentions=output_attentions,
965
+ use_cache=use_cache,
966
+ )
967
+
968
+ bsz, q_len, _ = hidden_states.size()
969
+
970
+ query_states = self.q_proj(hidden_states)
971
+ if self.attention_type == "cross" and kv_states is not None and isinstance(kv_states, tuple):
972
+ orig_key_states, orig_value_states = kv_states
973
+ key_states, value_states = kv_states
974
+ else:
975
+ key_states = self.k_proj(hidden_states)
976
+ value_states = self.v_proj(hidden_states)
977
+ orig_key_states, orig_value_states = key_states, value_states
978
+
979
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
980
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
981
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
982
+
983
+ kv_seq_len = key_states.shape[-2]
984
+ if past_key_value is not None:
985
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
986
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
987
+
988
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
989
+
990
+ if self.use_qk_norm:
991
+ query_states = self.query_layernorm(query_states)
992
+ key_states = self.key_layernorm(key_states)
993
+
994
+ if past_key_value is not None:
995
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
996
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
997
+
998
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
999
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
1000
+
1001
+ if attention_mask is not None:
1002
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
1003
+ raise ValueError(
1004
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
1005
+ )
1006
+
1007
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with
1008
+ # custom attn_mask,
1009
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
1010
+ if query_states.device.type == "cuda" and attention_mask is not None:
1011
+ query_states = query_states.contiguous()
1012
+ key_states = key_states.contiguous()
1013
+ value_states = value_states.contiguous()
1014
+
1015
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
1016
+ query_states,
1017
+ key_states,
1018
+ value_states,
1019
+ attn_mask=attention_mask,
1020
+ dropout_p=self.attention_dropout if self.training else 0.0,
1021
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a
1022
+ # causal mask in case q_len == 1.
1023
+ is_causal=self.is_causal and attention_mask is None and q_len > 1,
1024
+ )
1025
+
1026
+ attn_output = attn_output.transpose(1, 2).contiguous()
1027
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
1028
+
1029
+ attn_output = self.o_proj(attn_output)
1030
+
1031
+ return attn_output, None, past_key_value, (orig_key_states, orig_value_states)
1032
+
1033
+
1034
+ HUNYUAN_ATTENTION_CLASSES = {
1035
+ "eager": HunYuanAttention,
1036
+ "flash_attention_2": HunYuanFlashAttention2,
1037
+ "sdpa": HunYuanSdpaAttention,
1038
+ }
1039
+
1040
+
1041
+ class HunYuanDecoderLayer(nn.Module):
1042
+ def __init__(self, config: HunYuanConfig, layer_idx: int):
1043
+ super().__init__()
1044
+ self.hidden_size = config.hidden_size
1045
+ self.layer_idx = layer_idx
1046
+
1047
+ self.self_attn = HUNYUAN_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
1048
+
1049
+ if config.num_experts > 1:
1050
+ self.mlp = HunYuanMoE(config, layer_idx=layer_idx)
1051
+ else:
1052
+ self.mlp = HunYuanMLP(config, layer_idx=layer_idx, is_shared_mlp=False)
1053
+ self.input_layernorm = HunYuanRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1054
+ self.post_attention_layernorm = HunYuanRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1055
+
1056
+ def forward(
1057
+ self,
1058
+ hidden_states: torch.Tensor,
1059
+ attention_mask: Optional[torch.Tensor] = None,
1060
+ position_ids: Optional[torch.LongTensor] = None,
1061
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
1062
+ output_attentions: Optional[bool] = False,
1063
+ use_cache: Optional[bool] = False,
1064
+ kv_states: Optional[Tuple[torch.Tensor]] = None,
1065
+ **kwargs,
1066
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
1067
+ """
1068
+ Args:
1069
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
1070
+ attention_mask (`torch.FloatTensor`, *optional*):
1071
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
1072
+ query_sequence_length, key_sequence_length)` if default attention is used.
1073
+ output_attentions (`bool`, *optional*):
1074
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
1075
+ returned tensors for more detail.
1076
+ use_cache (`bool`, *optional*):
1077
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
1078
+ (see `past_key_values`).
1079
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
1080
+ kv_states (`Tuple(torch.FloatTensor)`, *optional*): Used when CLA is enabled,
1081
+ key and value states from past attention blocks
1082
+ """
1083
+ if "padding_mask" in kwargs:
1084
+ warnings.warn(
1085
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use "
1086
+ "`attention_mask` instead.`"
1087
+ )
1088
+
1089
+ residual = hidden_states
1090
+
1091
+ hidden_states = self.input_layernorm(hidden_states)
1092
+
1093
+ # Self Attention
1094
+ hidden_states, self_attn_weights, present_key_value, kv_states = self.self_attn(
1095
+ hidden_states=hidden_states,
1096
+ attention_mask=attention_mask,
1097
+ position_ids=position_ids,
1098
+ past_key_value=past_key_value,
1099
+ output_attentions=output_attentions,
1100
+ use_cache=use_cache,
1101
+ kv_states=kv_states,
1102
+ **kwargs,
1103
+ )
1104
+ hidden_states = residual + hidden_states
1105
+
1106
+ # Fully Connected
1107
+ residual = hidden_states
1108
+ hidden_states = self.post_attention_layernorm(hidden_states)
1109
+ hidden_states = self.mlp(hidden_states)
1110
+ hidden_states = residual + hidden_states
1111
+
1112
+ outputs = (hidden_states,)
1113
+
1114
+ if output_attentions:
1115
+ outputs += (self_attn_weights,)
1116
+
1117
+ if use_cache:
1118
+ outputs += (present_key_value,)
1119
+
1120
+ outputs += (kv_states,)
1121
+
1122
+ return outputs
1123
+
1124
+
1125
+ HUNYUAN_START_DOCSTRING = r"""
1126
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
1127
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
1128
+ etc.)
1129
+
1130
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
1131
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
1132
+ and behavior.
1133
+
1134
+ Parameters:
1135
+ config ([`HunYuanConfig`]):
1136
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
1137
+ load the weights associated with the model, only the configuration. Check out the
1138
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
1139
+ """
1140
+
1141
+
1142
+ @add_start_docstrings(
1143
+ "The bare HunYuan Model outputting raw hidden-states without any specific head on top.",
1144
+ HUNYUAN_START_DOCSTRING,
1145
+ )
1146
+ class HunYuanPreTrainedModel(PreTrainedModel):
1147
+ config_class = HunYuanConfig
1148
+ base_model_prefix = "model"
1149
+ supports_gradient_checkpointing = True
1150
+ _no_split_modules = ["HunYuanDecoderLayer"]
1151
+ _skip_keys_device_placement = "past_key_values"
1152
+ _supports_flash_attn_2 = True
1153
+ _supports_sdpa = True
1154
+ _supports_cache_class = True
1155
+
1156
+ def _init_weights(self, module):
1157
+ std = self.config.initializer_range
1158
+ if isinstance(module, nn.Linear):
1159
+ module.weight.data.normal_(mean=0.0, std=std)
1160
+ if module.bias is not None:
1161
+ module.bias.data.zero_()
1162
+ elif isinstance(module, nn.Embedding):
1163
+ module.weight.data.normal_(mean=0.0, std=std)
1164
+ if module.padding_idx is not None:
1165
+ module.weight.data[module.padding_idx].zero_()
1166
+
1167
+
1168
+ HUNYUAN_INPUTS_DOCSTRING = r"""
1169
+ Args:
1170
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
1171
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
1172
+ it.
1173
+
1174
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1175
+ [`PreTrainedTokenizer.__call__`] for details.
1176
+
1177
+ [What are input IDs?](../glossary#input-ids)
1178
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
1179
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
1180
+
1181
+ - 1 for tokens that are **not masked**,
1182
+ - 0 for tokens that are **masked**.
1183
+
1184
+ [What are attention masks?](../glossary#attention-mask)
1185
+
1186
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1187
+ [`PreTrainedTokenizer.__call__`] for details.
1188
+
1189
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
1190
+ `past_key_values`).
1191
+
1192
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
1193
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
1194
+ information on the default strategy.
1195
+
1196
+ - 1 indicates the head is **not masked**,
1197
+ - 0 indicates the head is **masked**.
1198
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1199
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1200
+ config.n_positions - 1]`.
1201
+
1202
+ [What are position IDs?](../glossary#position-ids)
1203
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
1204
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
1205
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
1206
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
1207
+
1208
+ Two formats are allowed:
1209
+ - a [`~cache_utils.Cache`] instance;
1210
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
1211
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
1212
+ cache format.
1213
+
1214
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
1215
+ legacy cache format will be returned.
1216
+
1217
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
1218
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
1219
+ of shape `(batch_size, sequence_length)`.
1220
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
1221
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
1222
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
1223
+ model's internal embedding lookup matrix.
1224
+ use_cache (`bool`, *optional*):
1225
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
1226
+ `past_key_values`).
1227
+ output_attentions (`bool`, *optional*):
1228
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1229
+ tensors for more detail.
1230
+ output_hidden_states (`bool`, *optional*):
1231
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1232
+ more detail.
1233
+ return_dict (`bool`, *optional*):
1234
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1235
+ """
1236
+
1237
+
1238
+ @add_start_docstrings(
1239
+ "The bare HunYuan Model outputting raw hidden-states without any specific head on top.",
1240
+ HUNYUAN_START_DOCSTRING,
1241
+ )
1242
+ class HunYuanModel(HunYuanPreTrainedModel):
1243
+ """
1244
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`HunYuanDecoderLayer`]
1245
+
1246
+ Args:
1247
+ config: HunYuanConfig
1248
+ """
1249
+
1250
+ def __init__(self, config: HunYuanConfig):
1251
+ super().__init__(config)
1252
+ self.padding_idx = config.pad_token_id
1253
+ self.vocab_size = config.vocab_size
1254
+
1255
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
1256
+ self.layers = nn.ModuleList(
1257
+ [HunYuanDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
1258
+ )
1259
+ self._use_sdpa = config._attn_implementation == "sdpa"
1260
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
1261
+ self.norm = HunYuanRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1262
+
1263
+ self.cla = config.use_cla
1264
+ self.cla_share_factor = config.cla_share_factor
1265
+
1266
+ self.gradient_checkpointing = False
1267
+ # Initialize weights and apply final processing
1268
+ self.post_init()
1269
+
1270
+ def get_input_embeddings(self):
1271
+ return self.embed_tokens
1272
+
1273
+ def set_input_embeddings(self, value):
1274
+ self.embed_tokens = value
1275
+
1276
+ @add_start_docstrings_to_model_forward(HUNYUAN_INPUTS_DOCSTRING)
1277
+ def forward(
1278
+ self,
1279
+ input_ids: torch.LongTensor = None,
1280
+ attention_mask: Optional[torch.Tensor] = None,
1281
+ position_ids: Optional[torch.LongTensor] = None,
1282
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1283
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1284
+ use_cache: Optional[bool] = None,
1285
+ output_attentions: Optional[bool] = None,
1286
+ output_hidden_states: Optional[bool] = None,
1287
+ return_dict: Optional[bool] = None,
1288
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
1289
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1290
+ output_hidden_states = (
1291
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1292
+ )
1293
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1294
+
1295
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1296
+
1297
+ # retrieve input_ids and inputs_embeds
1298
+ if input_ids is not None and inputs_embeds is not None:
1299
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
1300
+ elif input_ids is not None:
1301
+ batch_size, seq_length = input_ids.shape[:2]
1302
+ elif inputs_embeds is not None:
1303
+ batch_size, seq_length = inputs_embeds.shape[:2]
1304
+ else:
1305
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
1306
+
1307
+ if self.gradient_checkpointing and self.training:
1308
+ if use_cache:
1309
+ logger.warning_once(
1310
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
1311
+ )
1312
+ use_cache = False
1313
+
1314
+ past_key_values_length = 0
1315
+ if use_cache:
1316
+ use_legacy_cache = not isinstance(past_key_values, Cache)
1317
+ if use_legacy_cache:
1318
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
1319
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
1320
+
1321
+ if position_ids is None:
1322
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1323
+ position_ids = torch.arange(
1324
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
1325
+ )
1326
+ position_ids = position_ids.unsqueeze(0)
1327
+
1328
+ if inputs_embeds is None:
1329
+ inputs_embeds = self.embed_tokens(input_ids)
1330
+
1331
+ # Fix lora with gradient checkpointing training
1332
+ if self.training and inputs_embeds.is_leaf:
1333
+ inputs_embeds.requires_grad = True
1334
+
1335
+ if self._use_flash_attention_2:
1336
+ # 2d mask is passed through the layers
1337
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
1338
+ elif self._use_sdpa and not output_attentions:
1339
+ # output_attentions=True can not be supported when using SDPA, and we fall back on
1340
+ # the manual implementation that requires a 4D causal mask in all cases.
1341
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
1342
+ attention_mask,
1343
+ (batch_size, seq_length),
1344
+ inputs_embeds,
1345
+ past_key_values_length,
1346
+ )
1347
+ else:
1348
+ # 4d mask is passed through the layers
1349
+ attention_mask = _prepare_4d_causal_attention_mask(
1350
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
1351
+ )
1352
+
1353
+ # embed positions
1354
+ hidden_states = inputs_embeds
1355
+
1356
+ # decoder layers
1357
+ all_hidden_states = () if output_hidden_states else None
1358
+ all_self_attns = () if output_attentions else None
1359
+ next_decoder_cache = None
1360
+
1361
+ prev_kv_states = None
1362
+ for layer_idx, decoder_layer in enumerate(self.layers):
1363
+ if output_hidden_states:
1364
+ all_hidden_states += (hidden_states,)
1365
+
1366
+ if self.gradient_checkpointing and self.training:
1367
+ layer_outputs = self._gradient_checkpointing_func(
1368
+ decoder_layer.__call__,
1369
+ hidden_states,
1370
+ attention_mask,
1371
+ position_ids,
1372
+ past_key_values,
1373
+ output_attentions,
1374
+ use_cache,
1375
+ prev_kv_states,
1376
+ )
1377
+ else:
1378
+ layer_outputs = decoder_layer(
1379
+ hidden_states,
1380
+ attention_mask=attention_mask,
1381
+ position_ids=position_ids,
1382
+ past_key_value=past_key_values,
1383
+ output_attentions=output_attentions,
1384
+ use_cache=use_cache,
1385
+ kv_states=prev_kv_states
1386
+ )
1387
+
1388
+ hidden_states = layer_outputs[0]
1389
+
1390
+ if use_cache:
1391
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1392
+
1393
+ if output_attentions:
1394
+ all_self_attns += (layer_outputs[1],)
1395
+
1396
+ kv_states = layer_outputs[-1]
1397
+
1398
+ if self.cla and layer_idx % self.cla_share_factor == 0:
1399
+ prev_kv_states = kv_states
1400
+
1401
+ hidden_states = self.norm(hidden_states)
1402
+
1403
+ # add hidden states from the last decoder layer
1404
+ if output_hidden_states:
1405
+ all_hidden_states += (hidden_states,)
1406
+
1407
+ next_cache = None
1408
+ if use_cache:
1409
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
1410
+ if not return_dict:
1411
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1412
+ return BaseModelOutputWithPast(
1413
+ last_hidden_state=hidden_states,
1414
+ past_key_values=next_cache,
1415
+ hidden_states=all_hidden_states,
1416
+ attentions=all_self_attns,
1417
+ )
1418
+
1419
+
1420
+ class HunYuanMoEV1ForCausalLM(HunYuanPreTrainedModel):
1421
+ _tied_weights_keys = ["lm_head.weight"]
1422
+
1423
+ def __init__(self, config: HunYuanConfig):
1424
+ super().__init__(config)
1425
+ self.model = HunYuanModel(config)
1426
+ self.vocab_size = config.vocab_size
1427
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1428
+
1429
+ # Initialize weights and apply final processing
1430
+ self.post_init()
1431
+
1432
+ def get_input_embeddings(self):
1433
+ return self.model.embed_tokens
1434
+
1435
+ def set_input_embeddings(self, value):
1436
+ self.model.embed_tokens = value
1437
+
1438
+ def get_output_embeddings(self):
1439
+ return self.lm_head
1440
+
1441
+ def set_output_embeddings(self, new_embeddings):
1442
+ self.lm_head = new_embeddings
1443
+
1444
+ def set_decoder(self, decoder):
1445
+ self.model = decoder
1446
+
1447
+ def get_decoder(self):
1448
+ return self.model
1449
+
1450
+ @add_start_docstrings_to_model_forward(HUNYUAN_INPUTS_DOCSTRING)
1451
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1452
+ def forward(
1453
+ self,
1454
+ input_ids: torch.LongTensor = None,
1455
+ attention_mask: Optional[torch.Tensor] = None,
1456
+ position_ids: Optional[torch.LongTensor] = None,
1457
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1458
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1459
+ labels: Optional[torch.LongTensor] = None,
1460
+ use_cache: Optional[bool] = None,
1461
+ output_attentions: Optional[bool] = None,
1462
+ output_hidden_states: Optional[bool] = None,
1463
+ return_dict: Optional[bool] = None,
1464
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1465
+ r"""
1466
+ Args:
1467
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1468
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1469
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1470
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1471
+
1472
+ Returns:
1473
+
1474
+ Example:
1475
+
1476
+ ```python
1477
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM
1478
+
1479
+ >>> model = AutoModelForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1480
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1481
+
1482
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1483
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1484
+
1485
+ >>> # Generate
1486
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1487
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1488
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1489
+ ```"""
1490
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1491
+ output_hidden_states = (
1492
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1493
+ )
1494
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1495
+
1496
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1497
+ outputs = self.model(
1498
+ input_ids=input_ids,
1499
+ attention_mask=attention_mask,
1500
+ position_ids=position_ids,
1501
+ past_key_values=past_key_values,
1502
+ inputs_embeds=inputs_embeds,
1503
+ use_cache=use_cache,
1504
+ output_attentions=output_attentions,
1505
+ output_hidden_states=output_hidden_states,
1506
+ return_dict=return_dict,
1507
+ )
1508
+
1509
+ hidden_states = outputs[0]
1510
+ if self.config.pretraining_tp > 1:
1511
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
1512
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
1513
+ logits = torch.cat(logits, dim=-1)
1514
+ else:
1515
+ logits = self.lm_head(hidden_states)
1516
+ logits = logits.float()
1517
+
1518
+ loss = None
1519
+ if labels is not None:
1520
+ # Shift so that tokens < n predict n
1521
+ shift_logits = logits[..., :-1, :].contiguous()
1522
+ shift_labels = labels[..., 1:].contiguous()
1523
+ # Flatten the tokens
1524
+ loss_fct = CrossEntropyLoss()
1525
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1526
+ shift_labels = shift_labels.view(-1)
1527
+ # Enable model parallelism
1528
+ shift_labels = shift_labels.to(shift_logits.device)
1529
+ loss = loss_fct(shift_logits, shift_labels)
1530
+
1531
+ if not return_dict:
1532
+ output = (logits,) + outputs[1:]
1533
+ return (loss,) + output if loss is not None else output
1534
+
1535
+ return CausalLMOutputWithPast(
1536
+ loss=loss,
1537
+ logits=logits,
1538
+ past_key_values=outputs.past_key_values,
1539
+ hidden_states=outputs.hidden_states,
1540
+ attentions=outputs.attentions,
1541
+ )
1542
+
1543
+ def prepare_inputs_for_generation(
1544
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1545
+ ):
1546
+ if past_key_values is not None:
1547
+ if isinstance(past_key_values, Cache):
1548
+ cache_length = past_key_values.get_seq_length()
1549
+ past_length = past_key_values.seen_tokens
1550
+ max_cache_length = past_key_values.get_max_cache_shape()
1551
+ else:
1552
+ cache_length = past_length = past_key_values[0][0].shape[2]
1553
+ max_cache_length = None
1554
+
1555
+ # Keep only the unprocessed tokens:
1556
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1557
+ # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as
1558
+ # input)
1559
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1560
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length):]
1561
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1562
+ # input_ids based on the past_length.
1563
+ elif past_length < input_ids.shape[1]:
1564
+ input_ids = input_ids[:, past_length:]
1565
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1566
+
1567
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1568
+ if (
1569
+ max_cache_length is not None
1570
+ and attention_mask is not None
1571
+ and cache_length + input_ids.shape[1] > max_cache_length
1572
+ ):
1573
+ attention_mask = attention_mask[:, -max_cache_length:]
1574
+
1575
+ position_ids = kwargs.get("position_ids", None)
1576
+ if attention_mask is not None and position_ids is None:
1577
+ # create position_ids on the fly for batch generation
1578
+ position_ids = attention_mask.long().cumsum(-1) - 1
1579
+ position_ids.masked_fill_(attention_mask == 0, 1)
1580
+ if past_key_values:
1581
+ position_ids = position_ids[:, -input_ids.shape[1]:]
1582
+
1583
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1584
+ if inputs_embeds is not None and past_key_values is None:
1585
+ model_inputs = {"inputs_embeds": inputs_embeds}
1586
+ else:
1587
+ model_inputs = {"input_ids": input_ids}
1588
+
1589
+ model_inputs.update(
1590
+ {
1591
+ "position_ids": position_ids,
1592
+ "past_key_values": past_key_values,
1593
+ "use_cache": kwargs.get("use_cache"),
1594
+ "attention_mask": attention_mask,
1595
+ }
1596
+ )
1597
+ return model_inputs
1598
+
1599
+ @staticmethod
1600
+ def _reorder_cache(past_key_values, beam_idx):
1601
+ reordered_past = ()
1602
+ for layer_past in past_key_values:
1603
+ reordered_past += (
1604
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1605
+ )
1606
+ return reordered_past
1607
+
1608
+
1609
+ @add_start_docstrings(
1610
+ """
1611
+ The HunYuan Model transformer with a sequence classification head on top (linear layer).
1612
+
1613
+ [`HunYuanForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1614
+ (e.g. GPT-2) do.
1615
+
1616
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1617
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1618
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1619
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1620
+ each row of the batch).
1621
+ """,
1622
+ HUNYUAN_START_DOCSTRING,
1623
+ )
1624
+ class HunYuanForSequenceClassification(HunYuanPreTrainedModel):
1625
+ def __init__(self, config):
1626
+ super().__init__(config)
1627
+ self.num_labels = config.num_labels
1628
+ self.model = HunYuanModel(config)
1629
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1630
+
1631
+ # Initialize weights and apply final processing
1632
+ self.post_init()
1633
+
1634
+ def get_input_embeddings(self):
1635
+ return self.model.embed_tokens
1636
+
1637
+ def set_input_embeddings(self, value):
1638
+ self.model.embed_tokens = value
1639
+
1640
+ @add_start_docstrings_to_model_forward(HUNYUAN_INPUTS_DOCSTRING)
1641
+ def forward(
1642
+ self,
1643
+ input_ids: torch.LongTensor = None,
1644
+ attention_mask: Optional[torch.Tensor] = None,
1645
+ position_ids: Optional[torch.LongTensor] = None,
1646
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1647
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1648
+ labels: Optional[torch.LongTensor] = None,
1649
+ use_cache: Optional[bool] = None,
1650
+ output_attentions: Optional[bool] = None,
1651
+ output_hidden_states: Optional[bool] = None,
1652
+ return_dict: Optional[bool] = None,
1653
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1654
+ r"""
1655
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1656
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1657
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1658
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1659
+ """
1660
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1661
+
1662
+ transformer_outputs = self.model(
1663
+ input_ids,
1664
+ attention_mask=attention_mask,
1665
+ position_ids=position_ids,
1666
+ past_key_values=past_key_values,
1667
+ inputs_embeds=inputs_embeds,
1668
+ use_cache=use_cache,
1669
+ output_attentions=output_attentions,
1670
+ output_hidden_states=output_hidden_states,
1671
+ return_dict=return_dict,
1672
+ )
1673
+ hidden_states = transformer_outputs[0]
1674
+ logits = self.score(hidden_states)
1675
+
1676
+ if input_ids is not None:
1677
+ batch_size = input_ids.shape[0]
1678
+ else:
1679
+ batch_size = inputs_embeds.shape[0]
1680
+
1681
+ if self.config.pad_token_id is None and batch_size != 1:
1682
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1683
+ if self.config.pad_token_id is None:
1684
+ sequence_lengths = -1
1685
+ else:
1686
+ if input_ids is not None:
1687
+ sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1).to(
1688
+ logits.device
1689
+ )
1690
+ else:
1691
+ sequence_lengths = -1
1692
+
1693
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1694
+
1695
+ loss = None
1696
+ if labels is not None:
1697
+ labels = labels.to(logits.device)
1698
+ if self.config.problem_type is None:
1699
+ if self.num_labels == 1:
1700
+ self.config.problem_type = "regression"
1701
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1702
+ self.config.problem_type = "single_label_classification"
1703
+ else:
1704
+ self.config.problem_type = "multi_label_classification"
1705
+
1706
+ if self.config.problem_type == "regression":
1707
+ loss_fct = MSELoss()
1708
+ if self.num_labels == 1:
1709
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1710
+ else:
1711
+ loss = loss_fct(pooled_logits, labels)
1712
+ elif self.config.problem_type == "single_label_classification":
1713
+ loss_fct = CrossEntropyLoss()
1714
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1715
+ elif self.config.problem_type == "multi_label_classification":
1716
+ loss_fct = BCEWithLogitsLoss()
1717
+ loss = loss_fct(pooled_logits, labels)
1718
+ if not return_dict:
1719
+ output = (pooled_logits,) + transformer_outputs[1:]
1720
+ return ((loss,) + output) if loss is not None else output
1721
+
1722
+ return SequenceClassifierOutputWithPast(
1723
+ loss=loss,
1724
+ logits=pooled_logits,
1725
+ past_key_values=transformer_outputs.past_key_values,
1726
+ hidden_states=transformer_outputs.hidden_states,
1727
+ attentions=transformer_outputs.attentions,
1728
+ )
tokenization_hy.py ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import base64
2
+ import logging
3
+ import os
4
+ import unicodedata
5
+ from typing import Collection, Dict, List, Set, Tuple, Union
6
+
7
+ import tiktoken
8
+ from transformers import PreTrainedTokenizer, AddedToken
9
+
10
+ logger = logging.getLogger(__name__)
11
+
12
+
13
+ VOCAB_FILES_NAMES = {"vocab_file": "hy.tiktoken"}
14
+
15
+ PAT_STR = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
16
+ # PAT_STR = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
17
+ ENDOFTEXT = "<|endoftext|>"
18
+ STARTOFTEXT = "<|startoftext|>"
19
+ BOSTOKEN = "<|bos|>"
20
+ EOSTOKEN = "<|eos|>"
21
+ PADTOKEN = "<|pad|>"
22
+
23
+ # as the default behavior is changed to allow special tokens in
24
+ # regular texts, the surface forms of special tokens need to be
25
+ # as different as possible to minimize the impact
26
+ EXTRAS = tuple((f"<|extra_{i}|>" for i in range(205)))
27
+ # changed to use actual index to avoid misconfiguration with vocabulary expansion
28
+
29
+
30
+ SPECIAL_START_ID = 127957
31
+
32
+ def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
33
+ # with open(tiktoken_bpe_file, "rb", encoding="utf-8") as f:
34
+ # contents = f.read()
35
+ dic = {}
36
+ rank = 0
37
+ for line in open(tiktoken_bpe_file, "rb"):
38
+ if line:
39
+ token, _ = line.split()
40
+ if base64.b64decode(token) in dic:
41
+ continue
42
+ dic[base64.b64decode(token)] = int(rank)
43
+ rank += 1
44
+ global SPECIAL_START_ID
45
+ SPECIAL_START_ID=rank
46
+ return dic
47
+
48
+ # NOTE: Please use the code line to check `SPECIAL_START_ID` right, this will affect the SPECIAL_START_ID
49
+ # _load_tiktoken_bpe('/apdcephfs/share_1502809/shaneshu/tokenizer_exp/other_tokenizer_vocab/hy/' + VOCAB_FILES_NAMES['vocab_file'])
50
+ # print(SPECIAL_START_ID)
51
+
52
+ SPECIAL_TOKENS = tuple(
53
+ enumerate(
54
+ (
55
+ (
56
+ ENDOFTEXT,
57
+ STARTOFTEXT,
58
+ BOSTOKEN,
59
+ EOSTOKEN,
60
+ PADTOKEN,
61
+ )
62
+ + EXTRAS
63
+ ),
64
+ start=SPECIAL_START_ID,
65
+ )
66
+ )
67
+ # NOTE: Unused Token ID starts from 127962
68
+ SPECIAL_TOKENS_SET = set(t for i, t in SPECIAL_TOKENS)
69
+
70
+ class HYTokenizer(PreTrainedTokenizer):
71
+ """hunyuan tokenizer."""
72
+
73
+ vocab_files_names = VOCAB_FILES_NAMES
74
+
75
+ def __init__(
76
+ self,
77
+ vocab_file,
78
+ errors="replace",
79
+ extra_vocab_file=None,
80
+ **kwargs,
81
+ ):
82
+ super().__init__(**kwargs)
83
+
84
+ # how to handle errors in decoding UTF-8 byte sequences
85
+ # use ignore if you are in streaming inference
86
+ self.errors = errors
87
+
88
+ self.mergeable_ranks = _load_tiktoken_bpe(vocab_file) # type: Dict[bytes, int]
89
+ self.special_tokens = {
90
+ token: index
91
+ for index, token in SPECIAL_TOKENS
92
+ }
93
+
94
+ # try load extra vocab from file
95
+ if extra_vocab_file is not None:
96
+ used_ids = set(self.mergeable_ranks.values()) | set(self.special_tokens.values())
97
+ extra_mergeable_ranks = _load_tiktoken_bpe(extra_vocab_file)
98
+ for token, index in extra_mergeable_ranks.items():
99
+ if token in self.mergeable_ranks:
100
+ logger.info(f"extra token {token} exists, skipping")
101
+ continue
102
+ if index in used_ids:
103
+ logger.info(f'the index {index} for extra token {token} exists, skipping')
104
+ continue
105
+ self.mergeable_ranks[token] = index
106
+ # the index may be sparse after this, but don't worry tiktoken.Encoding will handle this
107
+
108
+ enc = tiktoken.Encoding(
109
+ "HunYuan",
110
+ pat_str=PAT_STR,
111
+ mergeable_ranks=self.mergeable_ranks,
112
+ special_tokens=self.special_tokens,
113
+ )
114
+ assert (
115
+ len(self.mergeable_ranks) + len(self.special_tokens) == enc.n_vocab
116
+ ), f"{len(self.mergeable_ranks)} + {len(self.special_tokens)} != {enc.n_vocab} in encoding"
117
+
118
+ self.decoder = {
119
+ v: k for k, v in self.mergeable_ranks.items()
120
+ } # type: dict[int, bytes|str]
121
+ self.decoder.update({v: k for k, v in self.special_tokens.items()})
122
+
123
+ self.tokenizer = enc # type: tiktoken.Encoding
124
+
125
+ self.eod_id = self.tokenizer.eot_token
126
+ self.bod_id = self.special_tokens[STARTOFTEXT]
127
+ self.bos_id = self.special_tokens[BOSTOKEN]
128
+ self.eos_id = self.special_tokens[EOSTOKEN]
129
+ self.pad_id = self.special_tokens[PADTOKEN]
130
+
131
+ def __getstate__(self):
132
+ # for pickle lovers
133
+ state = self.__dict__.copy()
134
+ del state["tokenizer"]
135
+ return state
136
+
137
+ def __setstate__(self, state):
138
+ # tokenizer is not python native; don't pass it; rebuild it
139
+ self.__dict__.update(state)
140
+ enc = tiktoken.Encoding(
141
+ "HunYuan",
142
+ pat_str=PAT_STR,
143
+ mergeable_ranks=self.mergeable_ranks,
144
+ special_tokens=self.special_tokens,
145
+ )
146
+ self.tokenizer = enc
147
+
148
+ def __len__(self) -> int:
149
+ return self.tokenizer.n_vocab
150
+
151
+ def get_vocab(self) -> Dict[bytes, int]:
152
+ return self.mergeable_ranks
153
+
154
+ def convert_tokens_to_ids(
155
+ self, tokens: Union[bytes, str, List[Union[bytes, str]]]
156
+ ) -> List[int]:
157
+ ids = []
158
+ if isinstance(tokens, (str, bytes)):
159
+ if tokens in self.special_tokens:
160
+ return self.special_tokens[tokens]
161
+ else:
162
+ return self.mergeable_ranks.get(tokens)
163
+ for token in tokens:
164
+ if token in self.special_tokens:
165
+ ids.append(self.special_tokens[token])
166
+ else:
167
+ ids.append(self.mergeable_ranks.get(token))
168
+ return ids
169
+
170
+ def _add_tokens(
171
+ self,
172
+ new_tokens: Union[List[str], List[AddedToken]],
173
+ special_tokens: bool = False,
174
+ ) -> int:
175
+ if not special_tokens and new_tokens:
176
+ raise ValueError("Adding regular tokens is not supported")
177
+ for token in new_tokens:
178
+ surface_form = token.content if isinstance(token, AddedToken) else token
179
+ if surface_form not in SPECIAL_TOKENS_SET:
180
+ raise ValueError("Adding unknown special tokens is not supported")
181
+ return 0
182
+
183
+ def save_vocabulary(self, save_directory: str, **kwargs) -> Tuple[str]:
184
+ """
185
+ Save only the vocabulary of the tokenizer (vocabulary).
186
+ Returns:
187
+ `Tuple(str)`: Paths to the files saved.
188
+ """
189
+ file_path = os.path.join(save_directory, "hunyuan.tiktoken")
190
+ with open(file_path, "w", encoding="utf-8") as w:
191
+ for k, v in self.mergeable_ranks.items():
192
+ line = base64.b64encode(k).decode("utf-8") + " " + str(v) + "\n"
193
+ w.write(line)
194
+ return (file_path,)
195
+
196
+ def tokenize(
197
+ self,
198
+ text: str,
199
+ allowed_special: Union[Set, str] = "all",
200
+ disallowed_special: Union[Collection, str] = (),
201
+ **kwargs,
202
+ ) -> List[Union[bytes, str]]:
203
+ """
204
+ Converts a string in a sequence of tokens.
205
+ Args:
206
+ text (`str`):
207
+ The sequence to be encoded.
208
+ allowed_special (`Literal["all"]` or `set`):
209
+ The surface forms of the tokens to be encoded as special tokens in regular texts.
210
+ Default to "all".
211
+ disallowed_special (`Literal["all"]` or `Collection`):
212
+ The surface forms of the tokens that should not be in regular texts and trigger errors.
213
+ Default to an empty tuple.
214
+ kwargs (additional keyword arguments, *optional*):
215
+ Will be passed to the underlying model specific encode method.
216
+ Returns:
217
+ `List[bytes|str]`: The list of tokens.
218
+ """
219
+ tokens = []
220
+ text = unicodedata.normalize("NFC", text)
221
+
222
+ # this implementation takes a detour: text -> token id -> token surface forms
223
+ for t in self.tokenizer.encode(
224
+ text, allowed_special=allowed_special, disallowed_special=disallowed_special
225
+ ):
226
+ tokens.append(self.decoder[t])
227
+ return tokens
228
+
229
+ def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
230
+ """
231
+ Converts a sequence of tokens in a single string.
232
+ """
233
+ text = ""
234
+ temp = b""
235
+ for t in tokens:
236
+ if isinstance(t, str):
237
+ if temp:
238
+ text += temp.decode("utf-8", errors=self.errors)
239
+ temp = b""
240
+ text += t
241
+ elif isinstance(t, bytes):
242
+ temp += t
243
+ else:
244
+ raise TypeError("token should only be of type types or str")
245
+ if temp:
246
+ text += temp.decode("utf-8", errors=self.errors)
247
+ return text
248
+
249
+ @property
250
+ def vocab_size(self):
251
+ return self.tokenizer.n_vocab
252
+
253
+ def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
254
+ """Converts an id to a token, special tokens included"""
255
+ if index in self.decoder:
256
+ return self.decoder[index]
257
+ raise ValueError("unknown ids")
258
+
259
+ def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
260
+ """Converts a token to an id using the vocab, special tokens included"""
261
+ if token in self.special_tokens:
262
+ return self.special_tokens[token]
263
+ if token in self.mergeable_ranks:
264
+ return self.mergeable_ranks[token]
265
+ raise ValueError("unknown token")
266
+
267
+ def _tokenize(self, text: str, **kwargs):
268
+ """
269
+ Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
270
+ vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
271
+ Do NOT take care of added tokens.
272
+ """
273
+ raise NotImplementedError
274
+
275
+ def _decode(
276
+ self,
277
+ token_ids: Union[int, List[int]],
278
+ skip_special_tokens: bool = False,
279
+ errors: str = None,
280
+ **kwargs,
281
+ ) -> str:
282
+ if isinstance(token_ids, int):
283
+ token_ids = [token_ids]
284
+ if skip_special_tokens:
285
+ token_ids = [i for i in token_ids if i < self.eod_id]
286
+ return self.tokenizer.decode(token_ids, errors=errors or self.errors)
287
+
288
+ # tests
289
+ if __name__ == "__main__":
290
+ tokenizer = HYTokenizer.from_pretrained('./hy')
291
+ text = '你好,世界'
292
+ tokens = tokenizer.tokenize(text)
293
+ print(tokens)
294
+ ids = tokenizer.convert_tokens_to_ids(tokens)
295
+ print(ids)
296
+ text2 = tokenizer.convert_tokens_to_string(tokens)
297
+ print(text2)
298
+ ids2 = tokenizer.convert_tokens_to_ids(tokens)
tokenizer_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GPT2LMHeadModel"
4
+ ],
5
+ "model_max_length": 1048576,
6
+ "tokenizer_class": "HYTokenizer",
7
+ "auto_map": {
8
+ "AutoTokenizer": [
9
+ "tokenization_hy.HYTokenizer",
10
+ null
11
+ ]
12
+ },
13
+ "eos_token": "<|eos|>",
14
+ "model_type": "gpt2",
15
+ "additional_special_tokens": ["<|startoftext|>", "<|extra_0|>", "<|extra_4|>", "<|extra_5|>", "<|eos|>"],
16
+ "pad_token": "<|pad|>",
17
+ "chat_template": "{% set loop_messages = messages %}\n{% if tools %}\n {% set weekday_map = {'Monday': '星期一', 'Tuesday': '星期二', 'Wednesday': '星期三', 'Thursday': '星期四', 'Friday': '星期五', 'Saturday': '星期六', 'Sunday': '星期日'} %}\n {% set weekday_cn = weekday_map[strftime_now('%A')] %}\n {% set datetime_str = strftime_now('%Y-%m-%d %H:%M:%S') %}\n {% set datetime_str = datetime_str + ' ' + weekday_cn %}\n {% for message in loop_messages %}\n {% if 'content' in message %}\n {% set content = message['content'] %}\n {% else %}\n {% set content = '' %}\n {% endif %}\n {% if loop.index0 == 0 %}\n {% set content_tmp = '你是一位函数组合专家。你会得到一个问题和一组可能的函数。根据问题,你需要进行一个或多个函数/工具调用以实现目的。\n如果没有一个函数可以使用,请直接使用自然语言回复用户,以助手:开头。\n如果给定的问题缺少函数所需的参数,请使用自然语言进行提问,向用户询问必要信息,以助手:开头。\n如果调用结果已经足够回答用户问题,请对历史结果进行总结,使用自然语言回复用户,以助手:开头。\n你应该只在工具调用部分返回函数调用。如果你决定调用任何函数,你必须将其格式化为<tool_calls>[{\"name\": \"func_name1\", \"arguments\": {\"argument1\": \"value1\", \"argument2\": \"value2\"}},...]</tool_calls>。你不应该在回复中包含任何其他文本。以下是你可以调用的函数列表,格式为JSON。\n' %}\n {% set content_tmp = content_tmp + '\n' + tools | tojson + '\n' %}\n {% if message['role'] == 'system' %}\n {% set content_tmp = content_tmp + '\n额外要求:\n' + content + '\n\n如果你决定返回函数调用,请将其格式化为<tool_calls>[{\"name\": \"func_name1\", \"arguments\": {\"argument1\": \"value1\", \"argument2\": \"value2\"}},...]</tool_calls>,不得包含其他文本。如果额外要求里有格式要求,请忽略,以此处为准。\n否则,请参考开头说的三种情况,以助手:开头进行回复。\n\n如果额外要求里有时间信息,就以额外要求里的时间为准,否则,参考当前时间:' + datetime_str %}\n {% set content = '<|startoftext|>' + content_tmp + '<|extra_4|>' %}\n {% elif message['role'] == 'user' %}\n {% set content_tmp = content_tmp + '\n如果你决定返回函数调用,请将其格式化为<tool_calls>[{\"name\": \"func_name1\", \"arguments\": {\"argument1\": \"value1\", \"argument2\": \"value2\"}},...]</tool_calls>,不得包含其他文本。\n否则,请参考开头说的三种情况,以助手:开头进行回复。\n\n当前时间:' + datetime_str %}\n {% set content_tmp = '<|startoftext|>' + content_tmp + '<|extra_4|>'%}\n {% set content = content_tmp + '用户:' + content + '<|extra_0|>' %}\n {% endif %}\n {% else %}\n {% if message['role'] == 'user' %}\n {% set content = '用户:' + content + '<|extra_0|>' %}\n {% elif message['role'] == 'assistant' %}\n {% if 'tool_calls' in message %}\n {% set tool_calls = message['tool_calls'] %}\n {% set ns = namespace(tool_calls=\"[\") %}\n {% for tool_call in tool_calls %}\n {% set function = tool_call['function'] %}\n {% set name = function['name'] %}\n {% set ns.tool_calls = ns.tool_calls + '{\"name\": \"' + name + '\", '%}\n {% set arguments = function['arguments'] %}\n {% if arguments is not string %}\n {% set arguments = arguments | tojson %}\n {% endif %}\n {% set ns.tool_calls = ns.tool_calls + '\"arguments\": ' + arguments + '}' %}\n {% if not loop.last %}\n {% set ns.tool_calls = ns.tool_calls + ', '%}\n {% endif %}\n {% endfor %}\n {% set ns.tool_calls = ns.tool_calls + ']' %}\n {% set content = content + '<tool_calls>' + ns.tool_calls + '</tool_calls>' %}\n {% else %}\n {% set content = '助手:' + content %}\n {% endif %}\n {% set content = content + '<|eos|>' %}\n {% elif message['role'] == 'tool' %}\n {% if content is not string %}\n {set content = content | tojson }\n {% endif %}\n {% set content = '<tool_response>' + content + '</tool_response>' %}\n {% set content = content + '<|extra_0|>' %}\n {% endif %}\n {% endif %}\n {{- content -}}\n {% endfor %}\n{% else %}\n {% set context = {'has_head': true} %}\n {% for message in loop_messages %}\n {% if 'content' in message %}\n {% set content = message['content'] %}\n {% else %}\n {% set content = '' %}\n {% endif %}\n {% if loop.index0 == 0 %}\n {% if content == '' %}\n {% set _ = context.update({'has_head': false}) %}\n {% elif message['role'] == 'system' %}\n {% set content = '<|startoftext|>' + content + '<|extra_4|>' %}\n {% endif %}\n {% endif %}\n {% if message['role'] == 'user' %}\n {% if loop.index0 == 1 and not context.has_head %}\n {% set content = '<|startoftext|>' + content %}\n {% endif %}\n {% if loop.index0 == 1 and context.has_head %}\n {% set content = content + '<|extra_0|>' %}\n {% else %}\n {% set content = '<|startoftext|>' + content + '<|extra_0|>' %}\n {% endif %}\n {% elif message['role'] == 'assistant' %}\n {% set content = content + '<|eos|>' %}\n {% elif message['role'] == 'tool' %}\n {% set content = content + '<|extra_0|>' %}\n {% endif %}\n {{- content -}}\n {% endfor %}\n{% endif %}\n{%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n' }}\n{%- endif %}"
18
+ }