woodchen7 commited on
Commit
b958250
ยท
verified ยท
1 Parent(s): 91254f9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +306 -0
README.md ADDED
@@ -0,0 +1,306 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ <p align="center">
3
+ <picture>
4
+ <source media="(prefers-color-scheme: dark)" srcset="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/logos/angelslim_logo_light.png?raw=true">
5
+ <img alt="AngelSlim" src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/logos/angelslim_logo.png?raw=true" width=55%>
6
+ </picture>
7
+ </p>
8
+
9
+ <h3 align="center">
10
+ Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.
11
+ </h3>
12
+
13
+ <p align="center">
14
+ ๐Ÿ“– <a href="https://angelslim.readthedocs.io/">Documentation</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿค— <a href="https://huggingface.co/AngelSlim">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿค– <a href="https://modelscope.cn/organization/AngelSlim">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿ’ฌ <a href="./docs/source/assets/angel_slim_wechat.png">WeChat</a>
15
+ <br>
16
+ </p>
17
+
18
+
19
+ ## Table of Contents
20
+
21
+ - [Latest Updates](#latest-updates)
22
+ - [Key Features](#key-features)
23
+ - [Supported Models](#supported-models)
24
+ - [How to Use](#how-to-use)
25
+ - [Install AngelSlim](#install-angelslim)
26
+ - [Quick Start](#quick-start)
27
+ - [deployment & Evaluation](#deployment)
28
+ - [Benchmark](#benchmark)
29
+ - [License](#license)
30
+ - [Citation](#citation)
31
+ - [Technical Discussion](#technical-discussion)
32
+
33
+ ## ๐Ÿ“ฃLatest Updates
34
+
35
+ - [25/07/04] We now support quantization for Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen and other models, including INT8/FP8/INT4 algorithms.
36
+ We also opensource Qwen3-8B`s Eagle3 model weight.
37
+
38
+ Coming soon:
39
+
40
+ - [ ] Support W4A8 quantization for DeepSeek-R1.
41
+ - [ ] Support quantization for multimodal models like Qwen-VL.
42
+ - [ ] Release of new algorithm for speculative sampling.
43
+
44
+ ## ๐ŸŒŸKey Features
45
+
46
+ - **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
47
+ - **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
48
+ - **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
49
+
50
+ ## ๐Ÿ’ผSupported Models
51
+
52
+ ### Quantization
53
+ Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ::
54
+
55
+ | Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
56
+ | --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- |
57
+ | [Hunyuan-Dense](https://huggingface.co/tencent/Hunyuan-7B-Instruct) | โœ… | โœ… | โœ… | โœ… | โœ… |
58
+ | [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | โœ… | โœ… | โœ… | โœ… | โœ… |
59
+ | [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | โœ… | โœ… | โœ… | โœ… | โœ… |
60
+ | [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | โœ… | โœ… | โœ… | โœ… | โœ… |
61
+ | [Qwen2.5](https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a) | โœ… | โœ… | โœ… | โœ… | โœ… |
62
+ | [DeepSeek-R1-Distill-Qwen](https://huggingface.co/collections/AngelSlim/deepseek-r1-distill-quant-68652f16a9c206b030b05f7f) | โœ… | โœ… | โœ… | โœ… | โœ… |
63
+ | [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | โœ… | โœ… | โœ… | โœ… | โœ… |
64
+
65
+ ### Speculative Decoding
66
+ The Eagle3 weights for the Qwen3-8B model are now available, with Eagle3 weights for other models in the Qwen3 series to be released soon.
67
+
68
+ | Model | Eagle3 |
69
+ | ----------| ----------------- |
70
+ | [Qwen3-8B](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3/tree/main) | โœ… |
71
+ | Qwen3-14B | coming soon |
72
+ | Qwen3-32B | coming soon |
73
+
74
+ ## ๐Ÿ›Ž๏ธHow to Use
75
+
76
+ ### Install AngelSlim
77
+
78
+ We recommend using `pip` to install the latest stable version of `AngelSlim`:
79
+
80
+ ```shell
81
+ pip install angelslim
82
+ ```
83
+
84
+ Alternatively, you can clone the repository and install from source in editable mode:
85
+
86
+ ```shell
87
+ cd AngelSlim && python setup.py install
88
+ ```
89
+
90
+ For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).
91
+
92
+ ### Quick Start
93
+
94
+ After installing `AngelSlim`, you can quickly start by running the following script to perform static `FP8` quantization on the `Qwen3-1.7B` model:
95
+
96
+ * One-click Start
97
+
98
+ ```shell
99
+ python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
100
+ ```
101
+
102
+ This example will load the HuggingFace model and perform activation value calibration using the `dataset` specified in the config file, saving the quantized model weights.
103
+
104
+ * Code-based Start
105
+
106
+ To perform dynamic `FP8` quantization on `Qwen3-1.7B`:
107
+
108
+ ```python
109
+ from angelslim.engine import Engine
110
+
111
+ slim_engine = Engine()
112
+ # Prepare model
113
+ slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
114
+ # Initialize compressor
115
+ slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
116
+ # Compress model
117
+ slim_engine.run()
118
+ # Save compressed model
119
+ slim_engine.save("./output")
120
+ ```
121
+
122
+ For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
123
+
124
+ ### ๐Ÿ–ฅ๏ธ Deployment and Testing
125
+
126
+ #### 1. API Service Deployment
127
+
128
+ After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
129
+
130
+ **vLLM**
131
+
132
+ Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required.
133
+
134
+
135
+ ```shell
136
+ bash deploy/run_vllm.sh $MODEL_PATH
137
+ ```
138
+
139
+ **SGLang**
140
+
141
+
142
+ Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
143
+
144
+ ```shell
145
+ bash deploy/run_sglang.sh $MODEL_PATH
146
+ ```
147
+
148
+ #### 2. Service Invocation
149
+
150
+ Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
151
+
152
+ ```shell
153
+ bash deploy/openai.sh $MODEL_PATH
154
+ ```
155
+
156
+ #### 3. Performance Evaluation
157
+
158
+ Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
159
+
160
+ ```shell
161
+ bash deploy/lm_eval.sh $MODEL_PATH
162
+ ```
163
+
164
+ For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
165
+
166
+
167
+ ## ๐Ÿ“ˆ Benchmark
168
+
169
+ ### Quantization
170
+
171
+ The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
172
+
173
+ #### Hunyuan Series Models
174
+
175
+ Benchmark results for the `Hunyuan-A13B-Instruct` model with `FP8` and `INT4-GPTQ` quantization algorithms on datasets including `AIME 2024`, `GSM8K`, `BBH`, and `DROP`:
176
+
177
+ | Bench | Hunyuan-A13B-Instruct | Hunyuan-A13B-Instruct-FP8 | Hunyuan-A13B-Instruct-Int4-GPTQ |
178
+ |:---------:|:---------------------:|:-------------------------:|:-------------------------------:|
179
+ | AIME 2024 | 87.3 | 86.7 | 86.7 |
180
+ | GSM8K | 94.39 | 94.01 | 94.24 |
181
+ | BBH | 89.1 | 88.34 | 87.91 |
182
+ | DROP | 91.1 | 91.1 | 91.05 |
183
+
184
+ #### Qwen3 Series Models
185
+
186
+ Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
187
+
188
+ <table>
189
+ <thead>
190
+ <tr><th>Model</th><th>Quantization</th><th>CEVAL</th><th>MMLU</th><th>GSM8K</th><th>HUMANEVAL</th></tr>
191
+ </thead>
192
+ <tbody>
193
+ <tr><td rowspan="4">Qwen3-0.6B</td><td>BF16</td><td>45.84</td><td>47.21</td><td>42.99</td><td>19.51</td></tr>
194
+ <tr><td>FP8-Static</td><td>45.99</td><td>46.87</td><td>38.06</td><td>18.90</td></tr>
195
+ <tr><td>FP8-Dynamic</td><td>45.99</td><td>46.93</td><td>38.29</td><td>20.73</td></tr>
196
+ <tr><td>INT8-Dynamic</td><td>45.17</td><td>46.95</td><td>41.17</td><td>21.34</td></tr>
197
+ <tr><td rowspan="6">Qwen3-8B</td><td>BF16</td><td>79.27</td><td>74.78</td><td>87.79</td><td>63.41</td></tr>
198
+ <tr><td>FP8-Static</td><td>78.23</td><td>74.79</td><td>86.96</td><td>62.20</td></tr>
199
+ <tr><td>FP8-Dynamic</td><td>78.45</td><td>74.75</td><td>87.64</td><td>62.80</td></tr>
200
+ <tr><td>INT8-Dynamic</td><td>78.01</td><td>74.84</td><td>86.96</td><td>67.07</td></tr>
201
+ <tr><td>INT4-GPTQ</td><td>77.19</td><td>73.26</td><td>86.43</td><td>62.20</td></tr>
202
+ <tr><td>INT4-AWQ</td><td>76.15</td><td>73.59</td><td>86.96</td><td>63.41</td></tr>
203
+ <tr><td rowspan="6">Qwen3-14B</td><td>BF16</td><td>83.06</td><td>78.90</td><td>88.40</td><td>55.49</td></tr>
204
+ <tr><td>FP8-Static</td><td>82.62</td><td>78.57</td><td>89.46</td><td>57.32</td></tr>
205
+ <tr><td>FP8-Dynamic</td><td>82.24</td><td>78.92</td><td>88.32</td><td>52.44</td></tr>
206
+ <tr><td>INT8-Dynamic</td><td>81.87</td><td>78.13</td><td>86.28</td><td>56.10</td></tr>
207
+ <tr><td>INT4-GPTQ</td><td>81.05</td><td>78.02</td><td>87.34</td><td>57.93</td></tr>
208
+ <tr><td>INT4-AWQ</td><td>82.02</td><td>77.68</td><td>84.23</td><td>61.59</td></tr>
209
+ <tr><td rowspan="5">Qwen3-32B</td><td>BF16</td><td>86.55</td><td>82.00</td><td>74.53</td><td>37.80</td></tr>
210
+ <tr><td>FP8-Static</td><td>86.92</td><td>81.78</td><td>70.20</td><td>39.63</td></tr>
211
+ <tr><td>FP8-Dynamic</td><td>86.55</td><td>81.89</td><td>70.43</td><td>38.41</td></tr>
212
+ <tr><td>INT4-GPTQ</td><td>86.18</td><td>81.01</td><td>-</td><td>43.29</td></tr>
213
+ <tr><td>INT4-AWQ</td><td>86.18</td><td>81.54</td><td>-</td><td>36.59</td></tr>
214
+ <tr><td rowspan="4">Qwen3-30B-A3B</td><td>BF16</td><td>83.66</td><td>79.36</td><td>89.99</td><td>31.71</td></tr>
215
+ <tr><td>FP8-Static</td><td>83.95</td><td>79.47</td><td>89.01</td><td>31.10</td></tr>
216
+ <tr><td>FP8-Dynamic</td><td>84.10</td><td>79.40</td><td>89.16</td><td>32.93</td></tr>
217
+ <tr><td>INT8-Dynamic</td><td>83.36</td><td>79.48</td><td>89.16</td><td>34.15</td></tr>
218
+ <tr><td rowspan="4">Qwen3-235B-A22B</td><td>BF16</td><td>89.60</td><td>86.28</td><td>85.29</td><td>27.44</td></tr>
219
+ <tr><td>FP8-Static</td><td>89.67</td><td>86.19</td><td>86.96</td><td>27.44</td></tr>
220
+ <tr><td>FP8-Dynamic</td><td>89.67</td><td>86.18</td><td>85.22</td><td>28.05</td></tr>
221
+ <tr><td>INT8-Dynamic</td><td>88.93</td><td>86.20</td><td>86.20</td><td>23.78</td></tr>
222
+ <tr><td rowspan="5">QwQ-32B</td><td>BF16</td><td>85.74</td><td>82.03</td><td>73.31</td><td>42.68</td></tr>
223
+ <tr><td>FP8-Static</td><td>85.44</td><td>81.91</td><td>75.36</td><td>42.68</td></tr>
224
+ <tr><td>FP8-Dynamic</td><td>85.07</td><td>81.93</td><td>75.66</td><td>42.07</td></tr>
225
+ <tr><td>INT4-GPTQ</td><td>84.03</td><td>81.26</td><td>68.23</td><td>45.73</td></tr>
226
+ <tr><td>INT4-AWQ</td><td>83.58</td><td>81.01</td><td>68.69</td><td>43.29</td></tr>
227
+ </tbody>
228
+ </table>
229
+
230
+ #### Other Models
231
+
232
+ Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
233
+
234
+ <table>
235
+ <thead>
236
+ <tr><th>Model</th><th>Quantization</th><th>CEVAL</th><th>MMLU</th><th>GSM8K</th></tr>
237
+ </thead>
238
+ <tbody>
239
+ <tr><td rowspan="3">Qwen2.5-1.5B-Instruct</td><td>BF16</td><td>67.01</td><td>60.05</td><td>54.28</td></tr>
240
+ <tr><td>FP8-Static</td><td>66.27</td><td>60.23</td><td>-</td></tr>
241
+ <tr><td>FP8-Dynamic</td><td>66.79</td><td>60.08</td><td>51.71</td></tr>
242
+ <tr><td rowspan="5">Qwen2.5-7B-Instruct</td><td>BF16</td><td>81.20</td><td>74.55</td><td>79.98</td></tr>
243
+ <tr><td>FP8-Static</td><td>81.13</td><td>74.03</td><td>79.30</td></tr>
244
+ <tr><td>FP8-Dynamic</td><td>80.31</td><td>74.07</td><td>79.00</td></tr>
245
+ <tr><td>INT4-GPTQ</td><td>79.05</td><td>73.05</td><td>74.75</td></tr>
246
+ <tr><td>INT4-AWQ</td><td>79.35</td><td>73.22</td><td>79.38</td></tr>
247
+ <tr><td rowspan="5">Qwen2.5-32B-Instruct</td><td>BF16</td><td>87.30</td><td>83.21</td><td>81.73</td></tr>
248
+ <tr><td>FP8-Static</td><td>87.59</td><td>83.08</td><td>81.58</td></tr>
249
+ <tr><td>FP8-Dynamic</td><td>87.30</td><td>83.04</td><td>81.58</td></tr>
250
+ <tr><td>INT4-GPTQ</td><td>86.70</td><td>82.45</td><td>82.03</td></tr>
251
+ <tr><td>INT4-AWQ</td><td>87.00</td><td>82.64</td><td>-</td></tr>
252
+ <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-7B</td><td>BF16</td><td>53.49</td><td>53.80</td><td>75.74</td></tr>
253
+ <tr><td>FP8-Static</td><td>53.57</td><td>54.17</td><td>76.19</td></tr>
254
+ <tr><td>FP8-Dynamic</td><td>52.97</td><td>54.13</td><td>74.15</td></tr>
255
+ <tr><td>INT4-GPTQ</td><td>51.86</td><td>52.44</td><td>75.89</td></tr>
256
+ <tr><td>INT4-AWQ</td><td>53.49</td><td>53.70</td><td>-</td></tr>
257
+ <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-14B</td><td>BF16</td><td>77.71</td><td>74.28</td><td>85.67</td></tr>
258
+ <tr><td>FP8-Static</td><td>77.56</td><td>74.66</td><td>86.73</td></tr>
259
+ <tr><td>FP8-Dynamic</td><td>76.82</td><td>74.63</td><td>87.11</td></tr>
260
+ <tr><td>INT4-GPTQ</td><td>74.29</td><td>72.37</td><td>84.61</td></tr>
261
+ <tr><td>INT4-AWQ</td><td>74.81</td><td>73.00</td><td>86.05</td></tr>
262
+ <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-32B</td><td>BF16</td><td>84.18</td><td>80.89</td><td>87.41</td></tr>
263
+ <tr><td>FP8-Static</td><td>83.43</td><td>80.90</td><td>87.57</td></tr>
264
+ <tr><td>FP8-Dynamic</td><td>83.73</td><td>81.10</td><td>86.43</td></tr>
265
+ <tr><td>INT4-GPTQ</td><td>84.10</td><td>79.80</td><td>86.73</td></tr>
266
+ <tr><td>INT4-AWQ</td><td>82.84</td><td>80.15</td><td>87.19</td></tr>
267
+ </tbody>
268
+ </table>
269
+
270
+ ### Speculative Decoding
271
+ Benchmark results for Qwen3 series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
272
+
273
+ #### Qwen3-8B
274
+
275
+ <table border="0">
276
+ <thead>
277
+ <tr><th rowspan="3">Temperature</th><th rowspan="3">Method</th><th colspan="8">Datasets</th></tr>
278
+ <tr><th colspan="2">MT-bench</th><th colspan="2">HumanEval</th><th colspan="2">GSM8K</th><th colspan="2">Alpaca</th></tr>
279
+ <tr><th>Speedup</th><th>Accept length</th><th>Speedup</th><th>Accept length</th><th>Speedup</th><th>Accept length</th><th>Speedup</th><th>Accept length</th></tr>
280
+ </thead>
281
+ <tbody>
282
+ <tr><td>T=0</td><td>Eagle3</td><td>2.63x</td><td>3.65</td><td>2.76x</td><td>3.85</td><td>2.82x</td><td>3.90</td><td>2.62x</td><td>3.48</td></tr>
283
+ <tr><td>T=1</td><td>Eagle3</td><td>1.98x</td><td>2.75</td><td>2.25x</td><td>3.11</td><td>2.31x</td><td>3.15</td><td>2.10x</td><td>2.76</td></tr>
284
+ </tbody>
285
+ </table>
286
+
287
+
288
+ ## ๐Ÿ“ Model License
289
+
290
+ The code for this project is open-sourced under the [License for AngelSlim](License_AngelSlim_model_and_dataset.txt).
291
+
292
+ ## ๐Ÿ”— Citation
293
+
294
+ ```
295
+ @software{AngelSlim2025,
296
+ title={{AngelSlim}},
297
+ author={Tencent AngelSlim Project Contributors},
298
+ year={2025},
299
+ month={6},
300
+ url={https://github.com/Tencent/AngelSlim},
301
+ }
302
+ ```
303
+
304
+ ## ๐Ÿ’ฌ Technical Discussion
305
+
306
+ * AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub or join our [WeChat technical discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).