ZeroWw commited on
Commit
ee43ec6
·
verified ·
1 Parent(s): 1549a58

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Hunyuan-7B-Instruct.f16.gguf filter=lfs diff=lfs merge=lfs -text
37
+ Hunyuan-7B-Instruct.q5_k.gguf filter=lfs diff=lfs merge=lfs -text
38
+ Hunyuan-7B-Instruct.q6_k.gguf filter=lfs diff=lfs merge=lfs -text
39
+ Hunyuan-7B-Instruct.q8_0.gguf filter=lfs diff=lfs merge=lfs -text
40
+ Hunyuan-7B-Instruct.q8_p.gguf filter=lfs diff=lfs merge=lfs -text
41
+ Hunyuan-7B-Instruct.q8q4.gguf filter=lfs diff=lfs merge=lfs -text
Hunyuan-7B-Instruct.f16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:13dc3c64e876da0c8513ab6f288fc20f6db66455c158d25bcc7743f52da80929
3
+ size 15014548224
Hunyuan-7B-Instruct.q5_k.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9efb84d9865b132dc0049fda38745b2d2aed51163dce14d08922d8588eec523e
3
+ size 5987881728
Hunyuan-7B-Instruct.q6_k.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a089a5834d1ec1952e48dfb968af093beb1e7d760b25996beb72af1e6f4c7cf6
3
+ size 6781129472
Hunyuan-7B-Instruct.q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3034c1c68fff8afee961f0f00836d555ff04c84375c8ec4fdad9f2417d05fc8
3
+ size 8471433984
Hunyuan-7B-Instruct.q8_p.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51cacc5de4ccc03ee3c6106e2d3812c5288e115450549663a236cead67210ab1
3
+ size 7979272704
Hunyuan-7B-Instruct.q8q4.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c83fe4bc8a84fe19bf4065157ed6a44124ca43ba65537e6e62a2cb828d3528c0
3
+ size 4749134336
Hunyuan-7B-Instruct/README.md ADDED
@@ -0,0 +1,507 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - tencent/Hunyuan-7B-Pretrain
4
+ library_name: transformers
5
+ ---
6
+
7
+
8
+
9
+ <p align="center">
10
+ <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
11
+ </p><p></p>
12
+
13
+
14
+ <p align="center">
15
+ 🤗&nbsp;<a href="https://huggingface.co/tencent/"><b>HuggingFace</b></a>&nbsp;|&nbsp;
16
+ 🤖&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct"><b>ModelScope</b></a>&nbsp;|&nbsp;
17
+ 🪡&nbsp;<a href="https://github.com/Tencent/AngelSlim/tree/main"><b>AngelSlim</b></a>
18
+ </p>
19
+
20
+ <p align="center">
21
+ 🖥️&nbsp;<a href="https://hunyuan.tencent.com" style="color: red;"><b>Official Website</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
22
+ 🕖&nbsp;<a href="https://cloud.tencent.com/product/hunyuan"><b>HunyuanAPI</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
23
+ 🕹️&nbsp;<a href="https://hunyuan.tencent.com/"><b>Demo</b></a>&nbsp;&nbsp;&nbsp;&nbsp;
24
+ </p>
25
+
26
+ <p align="center">
27
+ <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B"><b>GITHUB</b></a> |
28
+ <a href="https://cnb.cool/tencent/hunyuan/Hunyuan-7B"><b>cnb.cool</b></a> |
29
+ <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B/blob/main/LICENSE"><b>LICENSE</b></a> |
30
+ <a href="https://raw.githubusercontent.com/Tencent-Hunyuan/Hunyuan-A13B/main/assets/1751881231452.jpg"><b>WeChat</b></a> |
31
+ <a href="https://discord.gg/bsPcMEtV7v"><b>Discord</b></a>
32
+ </p>
33
+
34
+
35
+ ## Model Introduction
36
+
37
+ Hunyuan is Tencent's open-source efficient large language model series, designed for versatile deployment across diverse computational environments. From edge devices to high-concurrency production systems, these models deliver optimal performance with advanced quantization support and ultra-long context capabilities.
38
+
39
+ We have released a series of Hunyuan dense models, comprising both pre-trained and instruction-tuned variants, with parameter scales of 0.5B, 1.8B, 4B, and 7B. These models adopt training strategies similar to the Hunyuan-A13B, thereby inheriting its robust performance characteristics. This comprehensive model family enables flexible deployment optimization - from resource-constrained edge computing with smaller variants to high-throughput production environments with larger models, all while maintaining strong capabilities across diverse scenarios.
40
+
41
+ ### Key Features and Advantages
42
+
43
+ - **Hybrid Reasoning Support**: Supports both fast and slow thinking modes, allowing users to flexibly choose according to their needs.
44
+ - **Ultra-Long Context Understanding**: Natively supports a 256K context window, maintaining stable performance on long-text tasks.
45
+ - **Enhanced Agent Capabilities**: Optimized for agent tasks, achieving leading results on benchmarks such as BFCL-v3, τ-Bench and C3-Bench.
46
+ - **Efficient Inference**: Utilizes Grouped Query Attention (GQA) and supports multiple quantization formats, enabling highly efficient inference.
47
+
48
+ ## Related News
49
+ * 2025.7.30 We have open-sourced **Hunyuan-0.5B-Pretrain** , **Hunyuan-0.5B-Instruct** , **Hunyuan-1.8B-Pretrain** , **Hunyuan-1.8B-Instruct** , **Hunyuan-4B-Pretrain** , **Hunyuan-4B-Instruct** , **Hunyuan-7B-Pretrain** ,**Hunyuan-7B-Instruct** on Hugging Face.
50
+ <br>
51
+
52
+
53
+ ## Benchmark
54
+
55
+ Note: The following benchmarks are evaluated by TRT-LLM-backend on several **base models**.
56
+
57
+ | Model | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain|
58
+ |:------------------:|:---------------:|:--------------:|:-------------:|:---------------:|
59
+ | MMLU | 54.02 | 64.62 | 74.01 | 79.82 |
60
+ | MMLU-Redux | 54.72 | 64.42 | 73.53 | 79 |
61
+ | MMLU-Pro | 31.15 | 38.65 | 51.91 | 57.79 |
62
+ | SuperGPQA | 17.23 | 24.98 | 27.28 | 30.47 |
63
+ | BBH | 45.92 | 74.32 | 75.17 | 82.95 |
64
+ | GPQA | 27.76 | 35.81 | 43.52 | 44.07 |
65
+ | GSM8K | 55.64 | 77.26 | 87.49 | 88.25 |
66
+ | MATH | 42.95 | 62.85 | 72.25 | 74.85 |
67
+ | EvalPlus | 39.71 | 60.67 | 67.76 | 66.96 |
68
+ | MultiPL-E | 21.83 | 45.92 | 59.87 | 60.41 |
69
+ | MBPP | 43.38 | 66.14 | 76.46 | 76.19 |
70
+ | CRUX-O | 30.75 | 36.88 | 56.5 | 60.75 |
71
+ | Chinese SimpleQA | 12.51 | 22.31 | 30.53 | 38.86 |
72
+ | simpleQA (5shot) | 2.38 | 3.61 | 4.21 | 5.69 |
73
+
74
+
75
+ | Topic | Bench | Hunyuan-0.5B-Instruct | Hunyuan-1.8B-Instruct | Hunyuan-4B-Instruct | Hunyuan-7B-Instruct|
76
+ |:-------------------:|:----------------------------------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
77
+ | **Mathematics** | AIME 2024<br>AIME 2025<br>MATH | 17.2<br>20<br>48.5 | 56.7<br>53.9<br>86 | 78.3<br>66.5<br>92.6 | 81.1<br>75.3<br>93.7 |
78
+ | **Science** | GPQA-Diamond<br>OlympiadBench | 23.3<br>29.6 | 47.2<br>63.4 | 61.1<br>73.1 | 60.1<br>76.5 |
79
+ | **Coding** | Livecodebench<br>Fullstackbench | 11.1<br>20.9 | 31.5<br>42 | 49.4<br>54.6 | 57<br>56.3 |
80
+ | **Reasoning** | BBH<br>DROP<br>ZebraLogic | 40.3<br>52.8<br>34.5 | 64.6<br>76.7<br>74.6 | 83<br>78.2<br>83.5 | 87.8<br>85.9<br>85.1 |
81
+ | **Instruction<br>Following** | IF-Eval<br>SysBench | 49.7<br>28.1 | 67.6<br>55.5 | 76.6<br>68 | 79.3<br>72.7 |
82
+ | **Agent** | BFCL v3<br> τ-Bench<br>ComplexFuncBench<br> C3-Bench | 49.8<br>14.4<br>13.9<br>45.3 | 58.3<br>18.2<br>22.3<br>54.6 | 67.9<br>30.1<br>26.3<br>64.3 | 70.8<br>35.3<br>29.2<br>68.5 |
83
+ | **Long<br>Context** | PenguinScrolls<br>longbench-v2<br>FRAMES | 53.9<br>34.7<br>41.9 | 73.1<br>33.2<br>55.6 | 83.1<br>44.1<br>79.2 | 82<br>43<br>78.6 |
84
+
85
+
86
+ &nbsp;
87
+
88
+ ### Use with transformers
89
+ First, please install transformers. We will merge it into the main branch later.
90
+ ```SHELL
91
+ pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca
92
+ ```
93
+ Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning.
94
+ 1. Pass **"enable_thinking=False"** when calling apply_chat_template.
95
+ 2. Adding **"/no_think"** before the prompt will force the model not to use perform CoT reasoning. Similarly, adding **"/think"** before the prompt will force the model to perform CoT reasoning.
96
+
97
+ The following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output.
98
+
99
+ we use tencent/Hunyuan-7B-Instruct for example
100
+
101
+ ```python
102
+ from transformers import AutoModelForCausalLM, AutoTokenizer
103
+ import os
104
+ import re
105
+
106
+ model_name_or_path = "tencent/Hunyuan-7B-Instruct"
107
+
108
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
109
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto") # You may want to use bfloat16 and/or move to GPU here
110
+ messages = [
111
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
112
+ ]
113
+ tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
114
+ enable_thinking=True # Toggle thinking mode (default: True)
115
+ )
116
+
117
+ outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)
118
+
119
+ output_text = tokenizer.decode(outputs[0])
120
+ print("output_text=",output_text)
121
+ think_pattern = r'<think>(.*?)</think>'
122
+ think_matches = re.findall(think_pattern, output_text, re.DOTALL)
123
+
124
+ answer_pattern = r'<answer>(.*?)</answer>'
125
+ answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
126
+
127
+ think_content = [match.strip() for match in think_matches][0]
128
+ answer_content = [match.strip() for match in answer_matches][0]
129
+ print(f"thinking_content:{think_content}\n\n")
130
+ print(f"answer_content:{answer_content}\n\n")
131
+
132
+
133
+ ```
134
+
135
+ We recommend using the following set of parameters for inference. Note that our model does not have the default system_prompt.
136
+
137
+ ```json
138
+
139
+ {
140
+ "do_sample": true,
141
+ "top_k": 20,
142
+ "top_p": 0.8,
143
+ "repetition_penalty": 1.05,
144
+ "temperature": 0.7
145
+ }
146
+ ```
147
+
148
+ &nbsp;
149
+
150
+ ### Training Data Format
151
+
152
+ If you need to fine-tune our Instruct model, we recommend processing the data into the following format, corresponding to both slow-thinking and fast-thinking scenarios.
153
+
154
+ ```python
155
+ # think_pattern
156
+ think = ""
157
+ answer = ""
158
+ think_pattern = f"<think>\n{think}\n</think>\n<answer>\n{answer}\n</answer>"
159
+
160
+ # fast think pattern
161
+ messages = [
162
+ {"role": "system", "content": "You are a helpful assistant."},
163
+ {"role": "user", "content": "/no_think海水为什么是咸的" },
164
+ {"role": "assistant", "content": "<think>\n\n</think>\n<answer>\n海水是咸的主要是因为其中含有许多溶解在水中的盐类和矿物质。这些盐类和矿物质来自于地球表面的岩石和土壤中的化学物质,随着时间的推移,它们被带到了海洋中。当海水蒸发时,水分蒸发掉了,但盐类和矿物质仍然留在水中,导致海水变得更加咸味。因此,海水的咸度是由其中的盐类和矿物质的含量决定的。\n</answer>"}
165
+ ]
166
+
167
+ # slow think pattern
168
+ messages = [
169
+ {"role": "system", "content": "You are a helpful assistant."},
170
+ {"role": "user", "content": "1+1=" },
171
+ {"role": "assistant", "content": "<think>\n嗯,用户问的是1加1等于多少。首先,我需要确认这是一个基本的算术问题。1加1在十进制的数学体系中,���常的结果是2。不过,可能需要考虑是否有其他情况,比如二进制或者其他数制,但用户没有特别说明,所以默认应该是十进制。另外,有时候可能会有脑筋急转弯的情况,比如在某些语境下1+1可能等于1(比如1滴水加1滴水还是1滴水),但通常数学问题中都是2。所以最准确的回答应该是2。</think>\n<answer>\n在十进制的基本算术运算中,1加1的结果是2。这是数学中最基础的加法运算之一,遵循自然数的加法规则。因此,1 + 1 = 2。\n</answer>"}
172
+ ]
173
+
174
+ from transformers import AutoTokenizer
175
+ tokenizer = AutoTokenizer.from_pretrained("your_tokenizer_path", trust_remote_code=True)
176
+ train_ids = tokenizer.apply_chat_template(messages)
177
+ ```
178
+
179
+ &nbsp;
180
+
181
+ ### Train with LLaMA-Factory
182
+
183
+ In the following chapter, we will introduce how to use `LLaMA-Factory` to fine-tune the `Hunyuan` model.
184
+
185
+ #### Prerequisites
186
+
187
+ Verify installation of the following dependencies:
188
+ - **LLaMA-Factory**: Follow [official installation guide](https://github.com/hiyouga/LLaMA-Factory)
189
+ - **DeepSpeed** (optional): Follow [official installation guide](https://github.com/deepspeedai/DeepSpeed#installation)
190
+ - **Transformer Library**: Use the companion branch (Hunyuan-submitted code is pending review)
191
+ ```
192
+ pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca
193
+ ```
194
+
195
+ #### Data preparation
196
+
197
+ We need to prepare a custom dataset:
198
+ 1. Organize your data in `json` format and place it in the `data` directory in `LLaMA-Factory`. The current implementation uses the `sharegpt` dataset format, which requires the following structure:
199
+ ```
200
+ [
201
+ {
202
+ "messages": [
203
+ {
204
+ "role": "system",
205
+ "content": "System prompt (optional)"
206
+ },
207
+ {
208
+ "role": "user",
209
+ "content": "Human instruction"
210
+ },
211
+ {
212
+ "role": "assistant",
213
+ "content": "Model response"
214
+ }
215
+ ]
216
+ }
217
+ ]
218
+ ```
219
+ Refer to the [Data Format](#training-data-format) section mentioned earlier for details.
220
+
221
+ 2. Define your dataset in the data/dataset_info.json file using the following format:
222
+ ```
223
+ "dataset_name": {
224
+ "file_name": "dataset.json",
225
+ "formatting": "sharegpt",
226
+ "columns": {
227
+ "messages": "messages"
228
+ },
229
+ "tags": {
230
+ "role_tag": "role",
231
+ "content_tag": "content",
232
+ "user_tag": "user",
233
+ "assistant_tag": "assistant",
234
+ "system_tag": "system"
235
+ }
236
+ }
237
+ ```
238
+
239
+ #### Training execution
240
+
241
+ 1. Copy all files from the `train/llama_factory_support/example_configs` directory to the `example/hunyuan` directory in `LLaMA-Factory`.
242
+ 2. Modify the model path and dataset name in the configuration file `hunyuan_full.yaml`. Adjust other configurations as needed:
243
+ ```
244
+ ### model
245
+ model_name_or_path: [!!!add the model path here!!!]
246
+
247
+ ### dataset
248
+ dataset: [!!!add the dataset name here!!!]
249
+ ```
250
+ 3. Execute training commands:
251
+ *​​Single-node training​​
252
+ Note: Set the environment variable DISABLE_VERSION_CHECK to 1 to avoid version conflicts.
253
+ ```
254
+ export DISABLE_VERSION_CHECK=1
255
+ llamafactory-cli train examples/hunyuan/hunyuan_full.yaml
256
+ ```
257
+ *Multi-node training​​
258
+ Execute the following command on each node. Configure NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT according to your environment:
259
+ ```
260
+ export DISABLE_VERSION_CHECK=1
261
+ FORCE_TORCHRUN=1 NNODES=${NNODES} NODE_RANK=${NODE_RANK} MASTER_ADDR=${MASTER_ADDR} MASTER_PORT=${MASTER_PORT} \
262
+ llamafactory-cli train examples/hunyuan/hunyuan_full.yaml
263
+ ```
264
+
265
+ &nbsp;
266
+
267
+
268
+ ## Quantization Compression
269
+ We used our own [AngleSlim](https://github.com/tencent/AngelSlim) compression tool to produce FP8 and INT4 quantization models. `AngleSlim` is a toolset dedicated to creating a more user-friendly, comprehensive and efficient model compression solution.
270
+
271
+ ### FP8 Quantization
272
+ We use FP8-static quantization, FP8 quantization adopts 8-bit floating point format, through a small amount of calibration data (without training) to pre-determine the quantization scale, the model weights and activation values will be converted to FP8 format, to improve the inference efficiency and reduce the deployment threshold. We you can use AngleSlim quantization, you can also directly download our quantization completed open source model to use [LINK](https://huggingface.co/).
273
+
274
+ ### Int4 Quantization
275
+ We use the GPTQ and AWQ algorithm to achieve W4A16 quantization.
276
+
277
+ GPTQ processes the model weights layer by layer, uses a small amount of calibration data to minimize the reconfiguration error of the quantized weights, and adjusts the weights layer by layer by the optimization process of approximating the Hessian inverse matrix. The process eliminates the need to retrain the model and requires only a small amount of calibration data to quantize the weights, improving inference efficiency and lowering the deployment threshold.
278
+ AWQ using a small amount of calibration data (without the need for training), the amplitude of the activation values is statistically calculated. For each weight channel, a scaling coefficient s is computed to expand the numerical range of important weights, allowing more information to be retained during quantization.
279
+
280
+ You can use [AngleSlim](https://github.com/tencent/AngelSlim) quantization, you can also directly download our quantization completed open source model to use [LINK](https://huggingface.co/).
281
+
282
+
283
+
284
+ #### Quantization Benchmark
285
+ This subsection describes the Benchmark metrics for the Hunyuan quantitative model.
286
+
287
+ | Bench | Quantization | Hunyuan-0.5B-Instruct | Hunyuan-1.8B-Instruct | Hunyuan-4B-Instruct | Hunyuan-7B-Instruct |
288
+ |:-------------:|:---------------------------------:|:----------------------------:|:------------------------------:|:----------------------------:|:----------------------------:|
289
+ | DROP | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 52.8<br>51.6<br>50.9<br>48.9 | 76.7<br>75.1<br>73.0<br>71.7 | 78.2<br>78.3<br>78.1<br>78.2 | 85.9<br>86.0<br>85.7<br>85.9 |
290
+ | GPQA-Diamond | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 23.3<br>22.5<br>23.3<br>23.3 | 47.2<br>47.7<br>44.43<br>43.62 | 61.1<br>60.2<br>58.1<br>- | 60.1<br>60.1<br>60.0<br>60.1 |
291
+ | OlympiadBench | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 29.6<br>29.6<br>26.8<br>26.3 | 63.4<br>62.5<br>60.9<br>61.7 | 73.1<br>73.1<br>71.1<br>71.2 | 76.5<br>76.6<br>76.2<br>76.4 |
292
+ | AIME 2024 | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 17.2<br>17.2<br>-<br>- | 56.7<br>55.17<br>-<br>- | 78.3<br>76.6<br>-<br>- | 81.1<br>80.9<br>81.0<br>80.9 |
293
+
294
+
295
+ ## Deployment
296
+
297
+ For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
298
+
299
+ image: https://hub.docker.com/r/hunyuaninfer/hunyuan-7B/tags
300
+
301
+
302
+ ### TensorRT-LLM
303
+
304
+ #### Docker Image
305
+
306
+ We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
307
+
308
+ We use tencent/Hunyuan-7B-Instruct for example
309
+ - To get started:
310
+
311
+ https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
312
+
313
+ ```
314
+ docker pull hunyuaninfer/hunyuan-7B:hunyuan-moe-7B-trtllm
315
+ ```
316
+ ```
317
+ docker run --privileged --user root --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-7B:hunyuan-moe-7B-trtllm
318
+ ```
319
+
320
+ - Prepare Configuration file:
321
+
322
+ ```
323
+ cat >/path/to/extra-llm-api-config.yml <<EOF
324
+ use_cuda_graph: true
325
+ cuda_graph_padding_enabled: true
326
+ cuda_graph_batch_sizes:
327
+ - 1
328
+ - 2
329
+ - 4
330
+ - 8
331
+ - 16
332
+ - 32
333
+ print_iter_log: true
334
+ EOF
335
+ ```
336
+
337
+
338
+ - Start the API server:
339
+
340
+
341
+ ```
342
+ trtllm-serve \
343
+ /path/to/HunYuan-moe-7B \
344
+ --host localhost \
345
+ --port 8000 \
346
+ --backend pytorch \
347
+ --max_batch_size 32 \
348
+ --max_num_tokens 16384 \
349
+ --tp_size 2 \
350
+ --kv_cache_free_gpu_memory_fraction 0.6 \
351
+ --trust_remote_code \
352
+ --extra_llm_api_options /path/to/extra-llm-api-config.yml
353
+ ```
354
+
355
+
356
+ ### vllm
357
+
358
+ #### Start
359
+ Please use vLLM version v0.10.0 or higher for inference.
360
+
361
+ We use tencent/Hunyuan-7B-Instruct for example
362
+ - Download Model file:
363
+ - Huggingface: will download automicly by vllm.
364
+ - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-7B-Instruct`
365
+
366
+ - model download by huggingface:
367
+ ```shell
368
+ export MODEL_PATH=tencent/Hunyuan-7B-Instruct
369
+ ```
370
+
371
+ - model downloaded by modelscope:
372
+ ```shell
373
+ export MODEL_PATH=/root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-7B-Instruct/
374
+ ```
375
+
376
+ - Start the API server:
377
+
378
+ ```shell
379
+ python3 -m vllm.entrypoints.openai.api_server \
380
+ --host 0.0.0.0 \
381
+ --port 8000 \
382
+ --trust-remote-code \
383
+ --model ${MODEL_PATH} \
384
+ --tensor-parallel-size 1 \
385
+ --dtype bfloat16 \
386
+ --quantization experts_int8 \
387
+ --served-model-name hunyuan \
388
+ 2>&1 | tee log_server.txt
389
+ ```
390
+ - After running service script successfully, run the request script
391
+ ```shell
392
+ curl http://0.0.0.0:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
393
+ "model": "hunyuan",
394
+ "messages": [
395
+ {
396
+ "role": "system",
397
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
398
+ },
399
+ {
400
+ "role": "user",
401
+ "content": [{"type": "text", "text": "请按面积大小对四大洋进行排序,并给出面积最小的洋是哪一个?直接输出结果。"}]
402
+ }
403
+ ],
404
+ "max_tokens": 2048,
405
+ "temperature":0.7,
406
+ "top_p": 0.6,
407
+ "top_k": 20,
408
+ "repetition_penalty": 1.05,
409
+ "stop_token_ids": [127960]
410
+ }'
411
+ ```
412
+ #### Quantitative model deployment
413
+ This section describes the process of deploying a post-quantization model using vLLM.
414
+
415
+ Default server in BF16.
416
+
417
+ ##### Int8 quantitative model deployment
418
+ Deploying the Int8-weight-only version of the HunYuan-7B model only requires setting the environment variables
419
+
420
+ Next we start the Int8 service. Run:
421
+ ```shell
422
+ python3 -m vllm.entrypoints.openai.api_server \
423
+ --host 0.0.0.0 \
424
+ --port 8000 \
425
+ --trust-remote-code \
426
+ --model ${MODEL_PATH} \
427
+ --tensor-parallel-size 1 \
428
+ --dtype bfloat16 \
429
+ --served-model-name hunyuan \
430
+ --quantization experts_int8 \
431
+ 2>&1 | tee log_server.txt
432
+ ```
433
+
434
+
435
+ ##### Int4 quantitative model deployment
436
+ Deploying the Int4-weight-only version of the HunYuan-7B model only requires setting the environment variables , using the GPTQ method
437
+ ```shell
438
+ export MODEL_PATH=PATH_TO_INT4_MODEL
439
+ ```
440
+ Next we start the Int4 service. Run
441
+ ```shell
442
+ python3 -m vllm.entrypoints.openai.api_server \
443
+ --host 0.0.0.0 \
444
+ --port 8000 \
445
+ --trust-remote-code \
446
+ --model ${MODEL_PATH} \
447
+ --tensor-parallel-size 1 \
448
+ --dtype bfloat16 \
449
+ --served-model-name hunyuan \
450
+ --quantization gptq_marlin \
451
+ 2>&1 | tee log_server.txt
452
+ ```
453
+
454
+ ##### FP8 quantitative model deployment
455
+ Deploying the W8A8C8 version of the HunYuan-7B model only requires setting the environment variables
456
+
457
+
458
+ Next we start the FP8 service. Run
459
+ ```shell
460
+ python3 -m vllm.entrypoints.openai.api_server \
461
+ --host 0.0.0.0 \
462
+ --port 8000 \
463
+ --trust-remote-code \
464
+ --model ${MODEL_PATH} \
465
+ --tensor-parallel-size 1 \
466
+ --dtype bfloat16 \
467
+ --served-model-name hunyuan \
468
+ --kv-cache-dtype fp8 \
469
+ 2>&1 | tee log_server.txt
470
+ ```
471
+
472
+
473
+
474
+
475
+ ### SGLang
476
+
477
+ #### Docker Image
478
+
479
+ We also provide a pre-built Docker image based on the latest version of SGLang.
480
+
481
+ We use tencent/Hunyuan-7B-Instruct for example
482
+
483
+ To get started:
484
+
485
+ - Pull the Docker image
486
+
487
+ ```
488
+ docker pull lmsysorg/sglang:latest
489
+ ```
490
+
491
+ - Start the API server:
492
+
493
+ ```
494
+ docker run --entrypoint="python3" --gpus all \
495
+ --shm-size 32g \
496
+ -p 30000:30000 \
497
+ --ulimit nproc=10000 \
498
+ --privileged \
499
+ --ipc=host \
500
+ lmsysorg/sglang:latest \
501
+ -m sglang.launch_server --model-path hunyuan/huanyuan_7B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
502
+ ```
503
+
504
+
505
+ ## Contact Us
506
+
507
+ If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email ([email protected]).
Hunyuan-7B-Instruct/README_CN.md ADDED
@@ -0,0 +1,748 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="left">
2
+ <a href="README.md">English</a> | 中文</a>&nbsp
3
+ </p>
4
+ <br><br>
5
+
6
+ <p align="center">
7
+ <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
8
+ </p><p></p>
9
+
10
+
11
+ <p align="center">
12
+ 🤗&nbsp;<a href="https://huggingface.co/tencent/"><b>Hugging Face</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
13
+ <img src="https://avatars.githubusercontent.com/u/109945100?s=200&v=4" width="16"/>&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/"><b>ModelScope</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
14
+ <img src="https://cdn-avatars.huggingface.co/v1/production/uploads/6594d0c6c5f1cd69a48b261d/04ZNQlAfs08Bfg4B1o3XO.png" width="14"/>&nbsp;<a href="https://github.com/Tencent/AngelSlim/tree/main"><b>AngelSlim</b></a>
15
+ </p>
16
+
17
+ <p align="center">
18
+ 🖥️&nbsp;<a href="https://hunyuan.tencent.com" style="color: red;"><b>Official Website</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
19
+ 🕖&nbsp;<a href="https://cloud.tencent.com/product/hunyuan"><b>HunyuanAPI</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
20
+ 🕹️&nbsp;<a href="https://hunyuan.tencent.com/"><b>Demo</b></a>&nbsp;&nbsp;&nbsp;&nbsp;
21
+ </p>
22
+
23
+ <p align="center">
24
+ <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B"><b>GITHUB</b></a> |
25
+ <a href="https://cnb.cool/tencent/hunyuan/Hunyuan-7B"><b>cnb.cool</b></a> |
26
+ <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B/blob/main/LICENSE"><b>LICENSE</b></a>
27
+ </p>
28
+
29
+
30
+
31
+
32
+ ## 模型介绍
33
+
34
+ 混元是腾讯开源的高效大语言模型系列,专为多样化计算环境中的灵活部署而设计。从边缘设备到高并发生产系统,这些模型凭借先进的量化支持和超长上下文能力,在各种场景下都能提供最优性能。
35
+
36
+ 我们发布了一系列混元稠密模型,包括预训练和指令微调两种变体,参数规模涵盖0.5B、1.8B、4B和7B。这些模型采用了与混元-A13B相似的训练策略,因此继承了其强大的性能特征。这个全面的模型家族支持灵活的部署优化 - 从使用小尺寸的模型适配资源受限边缘计算场景,到使用较大尺寸的高性能模型支持高并发低延迟的复杂推理生产环境,在各种场景下都能保持强大的能力。
37
+
38
+
39
+ ### 核心特性与优势
40
+ - ​**混合推理支持**​:同时支持快思考和慢思考两种模式,支持用户灵活选择
41
+ - ​**超长上下文理解**​:原生支持256K上下文窗口,在长文本任务中保持稳定性能
42
+ - ​**增强Agent能力**​:优化Agent能力,在BFCL-v3、τ-Bench、C3-Bench等智能体基准测试中领先
43
+ - ​**高效推理**​:采用分组查询注意力(GQA)策略,支持多量化格式,实现高效推理
44
+
45
+ ## 新闻
46
+ <br>
47
+
48
+ * 2025.7.30 我们在Hugging Face开源了 **Hunyuan-0.5B-Pretrain** , **Hunyuan-1.8B-Pretrain** , **Hunyuan-4B-Pretrain** , **Hunyuan-7B-Pretrain** , **Hunyuan-0.5B-Instruct** , **Hunyuan-1.8B-Instruct** , **Hunyuan-4B-Instruct** , **Hunyuan-7B-Instruct**。
49
+
50
+ ## Benchmark评估榜单
51
+ | Model | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain|
52
+ |:------------------:|:---------------:|:--------------:|:-------------:|:---------------:|
53
+ | MMLU | 54.02 | 64.62 | 74.01 | 79.82 |
54
+ | MMLU-Redux | 54.72 | 64.42 | 73.53 | 79 |
55
+ | MMLU-Pro | 31.15 | 38.65 | 51.91 | 57.79 |
56
+ | SuperGPQA | 17.23 | 24.98 | 27.28 | 30.47 |
57
+ | BBH | 45.92 | 74.32 | 75.17 | 82.95 |
58
+ | GPQA | 27.76 | 35.81 | 43.52 | 44.07 |
59
+ | GSM8K | 55.64 | 77.26 | 87.49 | 88.25 |
60
+ | MATH | 42.95 | 62.85 | 72.25 | 74.85 |
61
+ | EvalPlus | 39.71 | 60.67 | 67.76 | 66.96 |
62
+ | MultiPL-E | 21.83 | 45.92 | 59.87 | 60.41 |
63
+ | MBPP | 43.38 | 66.14 | 76.46 | 76.19 |
64
+ | CRUX-O | 30.75 | 36.88 | 56.5 | 60.75 |
65
+ | Chinese SimpleQA | 12.51 | 22.31 | 30.53 | 38.86 |
66
+ | simpleQA (5shot) | 2.38 | 3.61 | 4.21 | 5.69 |
67
+
68
+
69
+ | Topic | Bench | Hunyuan-0.5B-Instruct | Hunyuan-1.8B-Instruct | Hunyuan-4B-Instruct | Hunyuan-7B-Instruct|
70
+ |:-------------------:|:----------------------------------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
71
+ | **Mathematics** | AIME 2024<br>AIME 2025<br>MATH | 17.2<br>20<br>48.5 | 56.7<br>53.9<br>86 | 78.3<br>66.5<br>92.6 | 81.1<br>75.3<br>93.7 |
72
+ | **Science** | GPQA-Diamond<br>OlympiadBench | 23.3<br>29.6 | 47.2<br>63.4 | 61.1<br>73.1 | 60.1<br>76.5 |
73
+ | **Coding** | Livecodebench<br>Fullstackbench | 11.1<br>20.9 | 31.5<br>42 | 49.4<br>54.6 | 57<br>56.3 |
74
+ | **Reasoning** | BBH<br>DROP<br>ZebraLogic | 40.3<br>52.8<br>34.5 | 64.6<br>76.7<br>74.6 | 83<br>78.2<br>83.5 | 87.8<br>85.9<br>85.1 |
75
+ | **Instruction<br>Following** | IF-Eval<br>SysBench | 49.7<br>28.1 | 67.6<br>55.5 | 76.6<br>68 | 79.3<br>72.7 |
76
+ | **Agent** | BFCL v3<br> τ-Bench<br>ComplexFuncBench<br> C3-Bench | 49.8<br>14.4<br>13.9<br>45.3 | 58.3<br>18.2<br>22.3<br>54.6 | 67.9<br>30.1<br>26.3<br>64.3 | 70.8<br>35.3<br>29.2<br>68.5 |
77
+ | **Long<br>Context** | PenguinScrolls<br>longbench-v2<br>FRAMES | 53.9<br>34.7<br>41.9 | 73.1<br>33.2<br>55.6 | 83.1<br>44.1<br>79.2 | 82<br>43<br>78.6 |
78
+
79
+ &nbsp;
80
+
81
+ ## 使用 transformers 推理
82
+
83
+ 我们的模型默认使用慢思考进行推理,有两种方法可以禁用 CoT 推理。
84
+
85
+ 1. 调用 apply_chat_template 时传递 **enable_thinking=False**。
86
+ 2. 在 prompt 前添加 **/no_think** 将会强制模型不使用 CoT 推理。同理,在 prompt 前添加 **/think** 将会强制模型执行 CoT 推理。
87
+
88
+ 以下代码片段展示了如何使用 transformers 库加载和使用模型。它还演示了如何禁用推理模式,以及如何解析出“推理过程”和“最终输出”。
89
+
90
+ ```python
91
+ from transformers import AutoModelForCausalLM, AutoTokenizer
92
+ import os
93
+ import re
94
+
95
+ model_name_or_path = os.environ['MODEL_PATH']
96
+ # model_name_or_path = "tencent/Hunyuan-7B-Instruct"
97
+
98
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
99
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
100
+ messages = [
101
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
102
+ ]
103
+ tokenized_chat = tokenizer.apply_chat_template(
104
+ messages,
105
+ tokenize=False
106
+ add_generation_prompt=True,
107
+ enable_thinking=True
108
+ )
109
+
110
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
111
+ model_inputs.pop("token_type_ids", None)
112
+ outputs = model.generate(**model_inputs, max_new_tokens=4096)
113
+ output_text = tokenizer.decode(outputs[0])
114
+
115
+ think_pattern = r'<think>(.*?)</think>'
116
+ think_matches = re.findall(think_pattern, output_text, re.DOTALL)
117
+
118
+ answer_pattern = r'<answer>(.*?)</answer>'
119
+ answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
120
+
121
+ think_content = [match.strip() for match in think_matches][0]
122
+ answer_content = [match.strip() for match in answer_matches][0]
123
+ print(f"thinking_content:{think_content}\n\n")
124
+ print(f"answer_content:{answer_content}\n\n")
125
+ ```
126
+
127
+
128
+ 我们推荐使用下面这组参数进行推理。注意,我们的模型没有默认 system_prompt。
129
+
130
+ ```json
131
+
132
+ {
133
+ "do_sample": true,
134
+ "top_k": 20,
135
+ "top_p": 0.8,
136
+ "repetition_penalty": 1.05,
137
+ "temperature": 0.7
138
+ }
139
+ ```
140
+
141
+ &nbsp;
142
+
143
+ ## 训练数据格式处理
144
+
145
+ 如果需要微调我们的 Instruct 模型,建议将数据处理成以下格式,分别对应慢思考和快思考的场景。
146
+
147
+ ```python
148
+ # think_pattern
149
+ think = ""
150
+ answer = ""
151
+ think_pattern = f"<think>\n{think}\n</think>\n<answer>\n{answer}\n</answer>"
152
+
153
+ # fast think pattern
154
+ messages = [
155
+ {"role": "system", "content": "You are a helpful assistant."},
156
+ {"role": "user", "content": "/no_think海水为什么是咸的" },
157
+ {"role": "assistant", "content": "<think>\n\n</think>\n<answer>\n海水是咸的主要是因为其中含有许多溶解在水中的盐类和矿物质。这些盐类和矿物质来自于地球表面的岩石和土壤中的化学物质,随着时间的推移,它们被带到了海洋中。当海水蒸发时,水分蒸发掉了,但盐类和矿物质仍然留在水中,导致海水变得更加咸味。因此,海水的咸度是由其中的盐类和矿物质的含量决定的。\n</answer>"}
158
+ ]
159
+
160
+ # slow think pattern
161
+ messages = [
162
+ {"role": "system", "content": "You are a helpful assistant."},
163
+ {"role": "user", "content": "1+1=" },
164
+ {"role": "assistant", "content": "<think>\n嗯,用户问的是1加1等于多少。首先,我需要确认这是一个基本的算术问题。1加1在十进制的数学体系中,通常的结果是2。不过,可能需要考虑是否有其他情况,比如二进制或者其他数制,但用户没有特别说明,所以默认应该是十进制。另外,有时候可能会有脑筋急转弯的情况,比如在某些语境下1+1可能等于1(比如1滴水加1滴水还是1滴水),但通常数学问题中都是2。所以最准确的回答应该是2。</think>\n<answer>\n在十进制的基本算术运算中,1加1的结果是2。这是数学中最基础的加法运算之一,遵循自然数的加法规则。因此,1 + 1 = 2。\n</answer>"}
165
+ ]
166
+
167
+ from transformers import AutoTokenizer
168
+ tokenizer = AutoTokenizer.from_pretrained("your_tokenizer_path", trust_remote_code=True)
169
+ train_ids = tokenizer.apply_chat_template(messages)
170
+ ```
171
+
172
+ &nbsp;
173
+
174
+ ## 使用 LLaMA-Factory 训练
175
+
176
+ 我们将介绍如何使用`LLaMA-Factory`来进行微调混元模型。
177
+
178
+ ### 安装环境
179
+
180
+ 开始之前,确保你已经安装了以下代码库:
181
+ 1. 使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)官方指导进行安装。
182
+ 2. 使用[DeepSpeed](https://github.com/deepspeedai/DeepSpeed#installation)官方指导进行安装(可选)。
183
+ 3. 安装配套的transformer库。当前混元提交的transformer代码正在评审中,需要获取配套的分支。
184
+ ```
185
+ pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca
186
+ ```
187
+
188
+ ### 准备数据
189
+
190
+ 我们需要准备自定义的数据集:
191
+
192
+ 1. 请将您的数据以`json`格式进行组织,并将数据放入`LLaMA-Factory`的`data`目录中。当前使用的是`sharegpt`格式的数据集,需要遵循以下格式:
193
+ ```
194
+ [
195
+ {
196
+ "messages": [
197
+ {
198
+ "role": "system",
199
+ "content": "系统提示词(选填)"
200
+ },
201
+ {
202
+ "role": "user",
203
+ "content": "人类指令"
204
+ },
205
+ {
206
+ "role": "assistant",
207
+ "content": "模型回答"
208
+ }
209
+ ]
210
+ }
211
+ ]
212
+ ```
213
+ 可以参考前面章节中对[数据格式](#训练数据格式处理)的说明。
214
+
215
+ 2. 在`data/dataset_info.json`文件中提供您的数据集定义,并采用以下格式:
216
+ ```
217
+ "数据集名称": {
218
+ "file_name": "data.json",
219
+ "formatting": "sharegpt",
220
+ "columns": {
221
+ "messages": "messages"
222
+ },
223
+ "tags": {
224
+ "role_tag": "role",
225
+ "content_tag": "content",
226
+ "user_tag": "user",
227
+ "assistant_tag": "assistant",
228
+ "system_tag": "system"
229
+ }
230
+ }
231
+ ```
232
+
233
+ ### 训练
234
+
235
+ 1. 将`train/llama_factory_support/example_configs`目录下的文件都拷贝到`LLaMA-Factory`的`example/hunyuan`目录下。
236
+ 2. 修改配置文件`hunyuan_full.yaml`中的模型路径和数据集名称,其他的配置请根据需要进行修改。
237
+ ```
238
+ ### model
239
+ model_name_or_path: [!!!add the model path here!!!]
240
+
241
+ ### dataset
242
+ dataset: [!!!add the data set name here!!!]
243
+ ```
244
+ 3. 执行训练命令
245
+ * 运行单机训练
246
+ 请注意这里需要设置`DISABLE_VERSION_CHECK`环境变量,避免版本冲突。
247
+ ```
248
+ export DISABLE_VERSION_CHECK=1
249
+ llamafactory-cli train examples/hunyuan/hunyuan_full.yaml
250
+ ```
251
+ * 运行多机训练
252
+ 在每个节点上执行以下命令。请注意将`torchrun`需要的`NNODES`、`NODE_RANK`、`MASTER_ADDR`和`MASTER_PORT`按照您运行的环境进行配置。
253
+ ```
254
+ export DISABLE_VERSION_CHECK=1
255
+ FORCE_TORCHRUN=1 NNODES=${NNODES} NODE_RANK=${NODE_RANK} MASTER_ADDR=${MASTER_ADDR} MASTER_PORT=${MASTER_PORT} \
256
+ llamafactory-cli train examples/hunyuan_full.yaml
257
+ ```
258
+
259
+ &nbsp;
260
+
261
+ ## 量化压缩
262
+
263
+ 我们使用了 [AngleSlim](https://github.com/tencent/AngelSlim) 压缩工具来生成 FP8 和 INT4 量化模型。`AngleSlim` 是一款专门致力于打造更易用、更全面且更高效的模型压缩解决方案的工具。
264
+
265
+ ### FP8 量化
266
+ 我们采用FP8-static量化,FP8量化采用8位浮点格式,通过少量校准数据(无需训练)预先确定量化scale,将模型权重与激活值转换为FP8格式,提升推理效率并降低部署门槛。 我们您可以使用AngleSlim量化,你也可以直接下载我们量化完成的开源模型使用[LINK](https://huggingface.co/).
267
+
268
+ ### Int4 Quantization
269
+ Int4量化我们采用GPTQ和AWQ算法实现W4A16量化。
270
+
271
+ GPTQ算法采用逐层处理模型权重,利用少量校准数据最小化量化后的权重重构误差,通过近似Hessian逆矩阵的优化过程逐层调整权重。流程无需重新训练模型,仅需少量校准数据即可量化权重,提升推理效率并降低部署门槛。
272
+ AWQ使用少量校准数据(无需进行训练)来计算激活值的幅度,从而进行统计计算。对于每个权重通道,都会计算一个缩放系数s,以扩大重要权重的数值表达范围,从而在量化过程中能够保留更多的信息。
273
+
274
+ 您可以使用 [AngleSlim](https://github.com/tencent/AngelSlim) 量化,也可以直接下载我们量化完成的开源模型使用 [LINK](https://huggingface.co/) 。
275
+
276
+
277
+ #### 量化 Benchmark
278
+ 本小节介绍了混元量化模型的基准指标。
279
+
280
+ | Bench | Quantization | Hunyuan-0.5B-Instruct | Hunyuan-1.8B-Instruct | Hunyuan-4B-Instruct | Hunyuan-7B-Instruct |
281
+ |:-------------:|:---------------------------------:|:----------------------------:|:------------------------------:|:----------------------------:|:----------------------------:|
282
+ | DROP | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 52.8<br>51.6<br>50.9<br>48.9 | 76.7<br>75.1<br>73.0<br>71.7 | 78.2<br>78.3<br>78.1<br>78.2 | 85.9<br>86.0<br>85.7<br>85.9 |
283
+ | GPQA-Diamond | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 23.3<br>22.5<br>23.3<br>23.3 | 47.2<br>47.7<br>44.43<br>43.62 | 61.1<br>60.2<br>58.1<br>- | 60.1<br>60.1<br>60.0<br>60.1 |
284
+ | OlympiadBench | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 29.6<br>29.6<br>26.8<br>26.3 | 63.4<br>62.5<br>60.9<br>61.7 | 73.1<br>73.1<br>72.9<br>72.8 | 76.5<br>76.6<br>76.2<br>76.4 |
285
+ | AIME 2024 | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 17.2<br>17.2<br>-<br>- | 56.7<br>55.17<br>-<br>- | 78.3<br>76.6<br>-<br>- | 81.1<br>80.9<br>81.0<br>80.9 |
286
+
287
+
288
+
289
+ &nbsp;
290
+
291
+ ## 推理和部署
292
+
293
+ HunyuanLLM可以采用TensorRT-LLM, vLLM或sglang部署。为了简化部署过程HunyuanLLM提供了预构建docker镜像,详见一下章节。
294
+
295
+ 镜像:https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags
296
+
297
+ ## 使用TensorRT-LLM推理
298
+ ### Docker:
299
+
300
+ 为了简化部署过程,HunyuanLLM提供了预构建docker镜像 (注意: 该镜像要求Host的Cuda版本为12.8以上):
301
+
302
+ [hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm](https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags) 。您只需要下载模型文件并用下面代码启动docker即可开始推理模型。
303
+ ```shell
304
+ # 拉取
305
+ 国内:
306
+ docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm
307
+ 国外:
308
+ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
309
+
310
+ # 启动
311
+ docker run --privileged --user root --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
312
+ ```
313
+
314
+ 注: Docker容器权限管理。以上代码采用特权模式(--privileged)启动Docker容器会赋予容器较高的权限,增加数据泄露和集群安全风险。建议在非必要情况下避免使用特权模式,以降低安全威胁。对于必须使用特权模式的场景,应进行严格的安全评估,并实施相应的安全监控、加固措施。
315
+
316
+ ### BF16部署
317
+
318
+ #### Step1:执行推理
319
+
320
+ #### 方式1:命令行推理
321
+
322
+ 下面我们展示一个代码片段,采用`TensorRT-LLM`快速请求chat model:
323
+ 修改 examples/pytorch/quickstart_advanced.py 中如下代码:
324
+
325
+
326
+ ```python
327
+ def setup_llm(args):
328
+ kv_cache_config = KvCacheConfig(
329
+ enable_block_reuse=not args.disable_kv_cache_reuse,
330
+ free_gpu_memory_fraction=args.kv_cache_fraction,
331
+ )
332
+ spec_config = None
333
+
334
+ hf_ckpt_path="$your_hunyuan_model_path"
335
+ tokenizer = AutoTokenizer.from_pretrained(hf_ckpt_path, trust_remote_code=True)
336
+ llm = LLM(
337
+ tokenizer=tokenizer,
338
+ model=args.model_dir,
339
+ backend='pytorch',
340
+ disable_overlap_scheduler=args.disable_overlap_scheduler,
341
+ kv_cache_dtype=args.kv_cache_dtype,
342
+ kv_cache_config=kv_cache_config,
343
+ attn_backend=args.attention_backend,
344
+ use_cuda_graph=args.use_cuda_graph,
345
+ cuda_graph_padding_enabled=args.cuda_graph_padding_enabled,
346
+ cuda_graph_batch_sizes=args.cuda_graph_batch_sizes,
347
+ load_format=args.load_format,
348
+ print_iter_log=args.print_iter_log,
349
+ enable_iter_perf_stats=args.print_iter_log,
350
+ torch_compile_config=TorchCompileConfig(
351
+ enable_fullgraph=args.use_torch_compile,
352
+ enable_inductor=args.use_torch_compile,
353
+ enable_piecewise_cuda_graph= \
354
+ args.use_piecewise_cuda_graph)
355
+ if args.use_torch_compile else None,
356
+ moe_backend=args.moe_backend,
357
+ enable_trtllm_sampler=args.enable_trtllm_sampler,
358
+ max_seq_len=args.max_seq_len,
359
+ max_batch_size=args.max_batch_size,
360
+ max_num_tokens=args.max_num_tokens,
361
+ enable_attention_dp=args.enable_attention_dp,
362
+ tensor_parallel_size=args.tp_size,
363
+ pipeline_parallel_size=args.pp_size,
364
+ moe_expert_parallel_size=args.moe_ep_size,
365
+ moe_tensor_parallel_size=args.moe_tp_size,
366
+ moe_cluster_parallel_size=args.moe_cluster_size,
367
+ enable_chunked_prefill=args.enable_chunked_prefill,
368
+ speculative_config=spec_config,
369
+ trust_remote_code=args.trust_remote_code,
370
+ gather_generation_logits=args.return_generation_logits)
371
+
372
+ sampling_params = SamplingParams(
373
+ end_id=127960,
374
+ max_tokens=args.max_tokens,
375
+ temperature=args.temperature,
376
+ top_k=args.top_k,
377
+ top_p=args.top_p,
378
+ return_context_logits=args.return_context_logits,
379
+ return_generation_logits=args.return_generation_logits,
380
+ logprobs=args.logprobs)
381
+ return llm, sampling_params
382
+
383
+
384
+ def main():
385
+ args = parse_arguments()
386
+ prompts = args.prompt if args.prompt else example_prompts
387
+
388
+ llm, sampling_params = setup_llm(args)
389
+ new_prompts = []
390
+ for prompt in prompts:
391
+ messages = [{"role": "user", "content": f"{prompt}"}]
392
+ new_prompts.append(
393
+ llm.tokenizer.apply_chat_template(messages,
394
+ tokenize=False,
395
+ add_generation_prompt=True))
396
+ prompts = new_prompts
397
+ outputs = llm.generate(prompts, sampling_params)
398
+
399
+ for i, output in enumerate(outputs):
400
+ prompt = output.prompt
401
+ generated_text = output.outputs[0].text
402
+ print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
403
+ ```
404
+
405
+ 运行方式:
406
+
407
+ ```shell
408
+ python3 quickstart_advanced.py --model_dir "HunyuanLLM模型路径" --tp_size 4
409
+ ```
410
+
411
+ #### 方式2:服务化推理
412
+
413
+ 下面我们展示使用`TensorRT-LLM`服务化的方式部署模型和请求。
414
+
415
+ 准备配置文件:
416
+
417
+ ```
418
+ cat >/path/to/extra-llm-api-config.yml <<EOF
419
+ use_cuda_graph: true
420
+ cuda_graph_padding_enabled: true
421
+ cuda_graph_batch_sizes:
422
+ - 1
423
+ - 2
424
+ - 4
425
+ - 8
426
+ - 16
427
+ - 32
428
+ print_iter_log: true
429
+ EOF
430
+ ```
431
+
432
+ 启动服务:
433
+
434
+ ```shell
435
+ trtllm-serve \
436
+ /path/to/HunYuan-moe-A13B \
437
+ --host localhost \
438
+ --port 8000 \
439
+ --backend pytorch \
440
+ --max_batch_size 32 \
441
+ --max_num_tokens 16384 \
442
+ --tp_size 2 \
443
+ --kv_cache_free_gpu_memory_fraction 0.6 \
444
+ --trust_remote_code \
445
+ --extra_llm_api_options /path/to/extra-llm-api-config.yml
446
+ ```
447
+
448
+ 服务启动成功后, 使用 OpenAI API 进行模型推理调用:
449
+ ```
450
+ curl -X POST "http://localhost:8000/v1/chat/completions" \
451
+ -H "Content-Type: application/json" \
452
+ --data '{
453
+ "model": "HunYuan/HunYuan-80B-A13B",
454
+ "messages": [
455
+ {
456
+ "role": "user",
457
+ "content": "Write a short summary of the benefits of regular exercise"
458
+ }
459
+ ]
460
+ }'
461
+ ```
462
+
463
+ #### FP8/Int4量化模型部署:
464
+ 目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中,敬请期待。
465
+
466
+
467
+ ## 使用vLLM推理
468
+ ### Docker:
469
+
470
+ 为了简化部署过程,HunyuanLLM提供了预构建docker镜像 (注意: 该镜像要求Host的Cuda版本为12.8以上):
471
+
472
+ [hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm](https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags) 。您只需要下载模型文件并用下面代码启动docker即可开始推理模型。
473
+ ```shell
474
+ # 下载模型:
475
+ # ModelScope:
476
+ modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct
477
+ # Huggingface: vllm 会自动下载
478
+
479
+ # 拉取
480
+ 国内:
481
+ docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm
482
+ 国外:
483
+ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
484
+
485
+ # 使用 huggingface 起服务
486
+ docker run --privileged --user root --net=host --ipc=host \
487
+ -v ~/.cache:/root/.cache/ \
488
+ --gpus=all -it --entrypoint python docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm \
489
+ -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
490
+ --tensor-parallel-size 4 --model tencent/Hunyuan-A13B-Instruct --trust-remote-code
491
+
492
+ # 使用modelscope下载的模型起服务
493
+ docker run --privileged --user root --net=host --ipc=host \
494
+ -v ~/.cache/modelscope:/root/.cache/modelscope \
495
+ --gpus=all -it --entrypoint python docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm \
496
+ -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --tensor-parallel-size 4 \
497
+ --port 8000 --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ --trust_remote_code
498
+ ```
499
+
500
+ 注: Docker容器权限管理。以上代码采用特权模式(--privileged)启动Docker容器会赋予容器较高的权限,增加数据泄露和集群安全风险。建议在非必要情况下避免使用特权模式,以降低安全威胁。对于必须使用特权模式的场景,应进行严格的安全评估,并实施相应的安全监控、加固措施。
501
+
502
+
503
+ ### BF16部署
504
+
505
+ BF16可以在2张显存超过80G的GPU卡上部署,如果长文推荐TP4。按如下步骤执行:
506
+
507
+ 运行命令前请先设置如下环境变量:
508
+
509
+ ```shell
510
+ export MODEL_PATH=PATH_TO_MODEL
511
+ ```
512
+
513
+ #### Step1:执行推理
514
+
515
+ #### 方式1:命令行推理
516
+
517
+ 下面我们展示一个代码片段,采用`vLLM`快速请求chat model:
518
+
519
+ 注: vLLM组件远程代码执行防护。下列代码中vLLM组件的trust-remote-code配置项若被启用,将允许加载并执行来自远程模型仓库的代码,这可能导致恶意代码的执行。除非业务需求明确要求,否则建议该配置项处于禁用状态,以降低潜在的安全威胁。
520
+
521
+
522
+ ```python
523
+ import os
524
+ from typing import List, Optional
525
+ from vllm import LLM, SamplingParams
526
+ from vllm.inputs import PromptType
527
+ from transformers import AutoTokenizer
528
+
529
+ model_path=os.environ.get('MODEL_PATH')
530
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
531
+
532
+ llm = LLM(model=model_path,
533
+ tokenizer=model_path,
534
+ trust_remote_code=True,
535
+ dtype='bfloat16',
536
+ tensor_parallel_size=4,
537
+ gpu_memory_utilization=0.9)
538
+
539
+ sampling_params = SamplingParams(
540
+ temperature=0.7, top_p=0.8, max_tokens=4096, top_k=20, repetition_penalty=1.05)
541
+
542
+ messages = [
543
+ {
544
+ "role": "system",
545
+ "content": "You are a helpful assistant.",
546
+ },
547
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
548
+ ]
549
+
550
+ tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
551
+
552
+ dummy_inputs: List[PromptType] = [{
553
+ "prompt_token_ids": batch
554
+ } for batch in tokenized_chat.numpy().tolist()]
555
+
556
+ outputs = llm.generate(dummy_inputs, sampling_params)
557
+
558
+ # Print the outputs.
559
+ for output in outputs:
560
+ prompt = output.prompt
561
+ generated_text = output.outputs[0].text
562
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
563
+ ```
564
+
565
+ #### 方式2:服务化推理
566
+
567
+ 下面我们展示使用`vLLM`服务化的方式部署模型并请求
568
+
569
+ 在主节点上运行:
570
+
571
+ ```shell
572
+ export VLLM_HOST_IP=${LOCAL_IP}
573
+ ```
574
+ 接着我们启动服务,运行 :
575
+ ```shell
576
+ cd inference
577
+ sh run_server.sh
578
+ ```
579
+
580
+ 运行`run_server.sh`成功后, 运行请求脚本:
581
+ ```shell
582
+ sh openapi.sh
583
+ ```
584
+
585
+ 注意修改`openapi.sh`中的`${LOCAL_IP}`和`${MODEL_PATH}`为服务对应值。
586
+
587
+
588
+ ### 量化模型部署:
589
+
590
+ 本部分介绍采用vLLM部署量化后模型的流程。
591
+
592
+ 镜像:部署镜像同BF16。
593
+
594
+
595
+ #### Int8量化模型部署:
596
+ 部署Int8-weight-only版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
597
+ ```SHELL
598
+ export MODEL_PATH=PATH_TO_BF16_MODEL
599
+ ```
600
+
601
+ 接着我们启动Int8服务。运行:
602
+ ```shell
603
+ sh run_server_int8.sh
604
+ ```
605
+
606
+ 运行`run_server_int8.sh`成功后, 运行请求脚本:
607
+ ```shell
608
+ sh openapi.sh
609
+ ```
610
+
611
+ #### Int4量化模型部署:
612
+ 部署Int4-weight-only版本HunYuan-A13B模型只需设置`run_server_int4.sh`中的环境变量,采用GPTQ方式:
613
+ ```SHELL
614
+ export MODEL_PATH=PATH_TO_INT4_MODEL
615
+ ```
616
+
617
+ 接着我们启动Int4服务。运行:
618
+ ```shell
619
+ sh run_server_int4.sh
620
+ ```
621
+
622
+ 运行`run_server_int4.sh`成功后, 运行请求脚本:
623
+ ```shell
624
+ sh openapi.sh
625
+ ```
626
+
627
+ #### FP8量化模型部署:
628
+ 部署W8A8C8版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
629
+ ```shell
630
+ export MODEL_PATH=PATH_TO_FP8_MODEL
631
+ ```
632
+
633
+ 接着我们启动FP8服务。运行:
634
+ ```shell
635
+ sh run_server_fp8.sh
636
+ ```
637
+
638
+ 运行`run_server_fp8.sh`成功后, 运行请求脚本:
639
+ ```shell
640
+ sh openapi.sh
641
+ ```
642
+ ## 使用sglang推理
643
+
644
+ ### BF16部署
645
+
646
+ #### Step1: 拉取镜像
647
+
648
+
649
+ ```
650
+ docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
651
+
652
+ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
653
+ ```
654
+
655
+ - 启动 API server:
656
+
657
+ ```
658
+ docker run --gpus all \
659
+ --shm-size 32g \
660
+ -p 30000:30000 \
661
+ --ipc=host \
662
+ docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
663
+ -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
664
+ ```
665
+
666
+ #### Step2:执行推理
667
+
668
+ #### 方式1:命令行推理
669
+
670
+ 下面我们展示一个代码片段,采用`sglang`快速请求chat model:
671
+
672
+
673
+ ```python
674
+ import sglang as sgl
675
+ from transformers import AutoTokenizer
676
+
677
+ model_path=os.environ.get('MODEL_PATH')
678
+
679
+
680
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
681
+
682
+ messages = [
683
+ {
684
+ "role": "system",
685
+ "content": "You are a helpful assistant.",
686
+ },
687
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
688
+ ]
689
+ prompts = []
690
+ prompts.append(tokenizer.apply_chat_template(
691
+ messages,
692
+ tokenize=False,
693
+ add_generation_prompt=True
694
+ ))
695
+ print(prompts)
696
+
697
+ llm = sgl.Engine(
698
+ model_path=model_path,
699
+ tp_size=4,
700
+ trust_remote_code=True,
701
+ mem_fraction_static=0.7,
702
+ )
703
+
704
+ sampling_params = {"temperature": 0.7, "top_p": 0.8, "top_k": 20, "max_new_tokens": 4096}
705
+ outputs = llm.generate(prompts, sampling_params)
706
+ for prompt, output in zip(prompts, outputs):
707
+ print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
708
+ ```
709
+
710
+ #### 方式2:服务化推理
711
+
712
+ 下面我们展示使用`sglang`服务化的方式部署模型和请求。
713
+
714
+ ```shell
715
+ model_path="HunyuanLLM模型路径"
716
+ python3 -u -m sglang.launch_server \
717
+ --model-path $model_path \
718
+ --tp 4 \
719
+ --trust-remote-code
720
+ ```
721
+
722
+ 服务启动成功后, 运行请求脚本:
723
+ ```python
724
+ import openai
725
+ client = openai.Client(
726
+ base_url="http://localhost:30000/v1", api_key="EMPTY")
727
+
728
+ response = client.chat.completions.create(
729
+ model="default",
730
+ messages= [
731
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
732
+ ],
733
+ temperature=0.7,
734
+ max_tokens=4096,
735
+ extra_body={"top_p": 0.8, "top_k": 20}
736
+ )
737
+ print(response)
738
+ ```
739
+
740
+ #### FP8/Int4量化模型部署:
741
+ 目前 sglang 的 fp8 和 int4 量化模型正在支持中,敬请期待。
742
+
743
+ ## 交互式Demo Web
744
+ hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
745
+
746
+
747
+ ## 联系我们
748
+ 如果你想给我们的研发和产品团队留言,欢迎联系我们腾讯混元LLM团队。你可以通过邮件([email protected])联系我们。
README.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license: mit
4
+ language:
5
+ - en
6
+ pipeline_tag: text-generation
7
+ ---
8
+
9
+ My own (ZeroWw) quantizations.
10
+ output and embed tensors quantized to f16.
11
+ all other tensors quantized to q5_k or q6_k.
12
+
13
+ Result:
14
+ both f16.q6 and f16.q5 are smaller than q8_0 standard quantization
15
+ and they perform as well as the pure f16.
16
+
17
+ Updated on: Mon Aug 04, 13:19:36