lbourdois commited on
Commit
03cd764
·
verified ·
1 Parent(s): b1f6249

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +228 -216
README.md CHANGED
@@ -1,216 +1,228 @@
1
- ---
2
- license: apache-2.0
3
- license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE
4
- language:
5
- - en
6
- pipeline_tag: text-generation
7
- base_model: Qwen/Qwen2.5-7B-Instruct
8
- tags:
9
- - chat
10
- - neuralmagic
11
- - llmcompressor
12
- - fp8
13
- ---
14
-
15
- # Qwen2.5-7B-Instruct-FP8-dynamic
16
-
17
- ## Model Overview
18
- - **Model Architecture:** Qwen2
19
- - **Input:** Text
20
- - **Output:** Text
21
- - **Model Optimizations:**
22
- - **Activation quantization:** FP8
23
- - **Weight quantization:** FP8
24
- - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
25
- - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
26
- - **Release Date:** 11/27/2024
27
- - **Version:** 1.0
28
- - **License(s):** [apache-2.0](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE)
29
- - **Model Developers:** Neural Magic
30
-
31
- ### Model Optimizations
32
-
33
- This model was obtained by quantizing activations and weights of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) to FP8 data type.
34
- This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
35
- Weight quantization also reduces disk size requirements by approximately 50%.
36
-
37
- Only weights and activations of the linear operators within transformers blocks are quantized.
38
- Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
39
- The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
40
-
41
- ## Deployment
42
-
43
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
44
-
45
- ```python
46
- from vllm import LLM, SamplingParams
47
- from transformers import AutoTokenizer
48
-
49
- model_id = "RedHatAI/Qwen2.5-7B-Instruct-FP8-dynamic"
50
- number_gpus = 1
51
- max_model_len = 8192
52
-
53
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
54
-
55
- tokenizer = AutoTokenizer.from_pretrained(model_id)
56
-
57
- messages = [
58
- {"role": "user", "content": "Give me a short introduction to large language model."},
59
- ]
60
-
61
- prompts = tokenizer.apply_chat_template(messages, tokenize=False)
62
-
63
- llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
64
-
65
- outputs = llm.generate(prompts, sampling_params)
66
-
67
- generated_text = outputs[0].outputs[0].text
68
- print(generated_text)
69
- ```
70
-
71
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
72
-
73
- ## Creation
74
-
75
- <details>
76
- <summary>Creation details</summary>
77
- This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
78
-
79
-
80
- ```python
81
- from transformers import AutoModelForCausalLM, AutoTokenizer
82
- from llmcompressor.modifiers.quantization import QuantizationModifier
83
- from llmcompressor.transformers import oneshot
84
-
85
- # Load model
86
- model_stub = "Qwen/Qwen2.5-7B-Instruct-FP8-dynamic"
87
- model_name = model_stub.split("/")[-1]
88
-
89
- tokenizer = AutoTokenizer.from_pretrained(model_stub)
90
-
91
- model = AutoModelForCausalLM.from_pretrained(
92
- model_stub,
93
- device_map="auto",
94
- torch_dtype="auto",
95
- )
96
-
97
- # Configure the quantization algorithm and scheme
98
- recipe = QuantizationModifier(
99
- targets="Linear",
100
- scheme="FP8_dynamic",
101
- ignore=["lm_head"],
102
- )
103
-
104
- # Apply quantization
105
- oneshot(
106
- model=model,
107
- recipe=recipe,
108
- )
109
-
110
- # Save to disk in compressed-tensors format
111
- save_path = model_name + "-FP8-dynamic"
112
- model.save_pretrained(save_path)
113
- tokenizer.save_pretrained(save_path)
114
- print(f"Model and tokenizer saved to: {save_path}")
115
- ```
116
- </details>
117
-
118
- ## Evaluation
119
-
120
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/387Bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 387Bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
121
- ```
122
- lm_eval \
123
- --model vllm \
124
- --model_args pretrained="neuralmagic/Qwen2.5-7B-Instruct-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
125
- --apply_chat_template \
126
- --fewshot_as_multiturn \
127
- --tasks openllm \
128
- --batch_size auto
129
- ```
130
-
131
- ### Accuracy
132
-
133
- #### Open LLM Leaderboard evaluation scores
134
- <table>
135
- <tr>
136
- <th>Benchmark
137
- </th>
138
- <th>Qwen2.5-7B-Instruct
139
- </th>
140
- <th>Qwen2.5-7B-Instruct-FP8-dynamic<br>(this model)
141
- </th>
142
- <th>Recovery
143
- </th>
144
- </tr>
145
- <tr>
146
- <td>MMLU (5-shot)
147
- </td>
148
- <td>74.24
149
- </td>
150
- <td>74.04
151
- </td>
152
- <td>99.7%
153
- </td>
154
- </tr>
155
- <tr>
156
- <td>ARC Challenge (25-shot)
157
- </td>
158
- <td>63.40
159
- </td>
160
- <td>63.14
161
- </td>
162
- <td>99.6%
163
- </td>
164
- </tr>
165
- <tr>
166
- <td>GSM-8K (5-shot, strict-match)
167
- </td>
168
- <td>80.36
169
- </td>
170
- <td>80.06
171
- </td>
172
- <td>99.6%
173
- </td>
174
- </tr>
175
- <tr>
176
- <td>Hellaswag (10-shot)
177
- </td>
178
- <td>81.52
179
- </td>
180
- <td>81.11
181
- </td>
182
- <td>99.5%
183
- </td>
184
- </tr>
185
- <tr>
186
- <td>Winogrande (5-shot)
187
- </td>
188
- <td>74.66
189
- </td>
190
- <td>74.43
191
- </td>
192
- <td>99.7%
193
- </td>
194
- </tr>
195
- <tr>
196
- <td>TruthfulQA (0-shot, mc2)
197
- </td>
198
- <td>64.76
199
- </td>
200
- <td>64.87
201
- </td>
202
- <td>100.2%
203
- </td>
204
- </tr>
205
- <tr>
206
- <td><strong>Average</strong>
207
- </td>
208
- <td><strong>73.16</strong>
209
- </td>
210
- <td><strong>72.94</strong>
211
- </td>
212
- <td><strong>99.7%</strong>
213
- </td>
214
- </tr>
215
- </table>
216
-
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE
4
+ language:
5
+ - zho
6
+ - eng
7
+ - fra
8
+ - spa
9
+ - por
10
+ - deu
11
+ - ita
12
+ - rus
13
+ - jpn
14
+ - kor
15
+ - vie
16
+ - tha
17
+ - ara
18
+ pipeline_tag: text-generation
19
+ base_model: Qwen/Qwen2.5-7B-Instruct
20
+ tags:
21
+ - chat
22
+ - neuralmagic
23
+ - llmcompressor
24
+ - fp8
25
+ ---
26
+
27
+ # Qwen2.5-7B-Instruct-FP8-dynamic
28
+
29
+ ## Model Overview
30
+ - **Model Architecture:** Qwen2
31
+ - **Input:** Text
32
+ - **Output:** Text
33
+ - **Model Optimizations:**
34
+ - **Activation quantization:** FP8
35
+ - **Weight quantization:** FP8
36
+ - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
37
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
38
+ - **Release Date:** 11/27/2024
39
+ - **Version:** 1.0
40
+ - **License(s):** [apache-2.0](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE)
41
+ - **Model Developers:** Neural Magic
42
+
43
+ ### Model Optimizations
44
+
45
+ This model was obtained by quantizing activations and weights of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) to FP8 data type.
46
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
47
+ Weight quantization also reduces disk size requirements by approximately 50%.
48
+
49
+ Only weights and activations of the linear operators within transformers blocks are quantized.
50
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
51
+ The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
52
+
53
+ ## Deployment
54
+
55
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
56
+
57
+ ```python
58
+ from vllm import LLM, SamplingParams
59
+ from transformers import AutoTokenizer
60
+
61
+ model_id = "RedHatAI/Qwen2.5-7B-Instruct-FP8-dynamic"
62
+ number_gpus = 1
63
+ max_model_len = 8192
64
+
65
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
68
+
69
+ messages = [
70
+ {"role": "user", "content": "Give me a short introduction to large language model."},
71
+ ]
72
+
73
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
74
+
75
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
76
+
77
+ outputs = llm.generate(prompts, sampling_params)
78
+
79
+ generated_text = outputs[0].outputs[0].text
80
+ print(generated_text)
81
+ ```
82
+
83
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
84
+
85
+ ## Creation
86
+
87
+ <details>
88
+ <summary>Creation details</summary>
89
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
90
+
91
+
92
+ ```python
93
+ from transformers import AutoModelForCausalLM, AutoTokenizer
94
+ from llmcompressor.modifiers.quantization import QuantizationModifier
95
+ from llmcompressor.transformers import oneshot
96
+
97
+ # Load model
98
+ model_stub = "Qwen/Qwen2.5-7B-Instruct-FP8-dynamic"
99
+ model_name = model_stub.split("/")[-1]
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
102
+
103
+ model = AutoModelForCausalLM.from_pretrained(
104
+ model_stub,
105
+ device_map="auto",
106
+ torch_dtype="auto",
107
+ )
108
+
109
+ # Configure the quantization algorithm and scheme
110
+ recipe = QuantizationModifier(
111
+ targets="Linear",
112
+ scheme="FP8_dynamic",
113
+ ignore=["lm_head"],
114
+ )
115
+
116
+ # Apply quantization
117
+ oneshot(
118
+ model=model,
119
+ recipe=recipe,
120
+ )
121
+
122
+ # Save to disk in compressed-tensors format
123
+ save_path = model_name + "-FP8-dynamic"
124
+ model.save_pretrained(save_path)
125
+ tokenizer.save_pretrained(save_path)
126
+ print(f"Model and tokenizer saved to: {save_path}")
127
+ ```
128
+ </details>
129
+
130
+ ## Evaluation
131
+
132
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/387Bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 387Bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
133
+ ```
134
+ lm_eval \
135
+ --model vllm \
136
+ --model_args pretrained="neuralmagic/Qwen2.5-7B-Instruct-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
137
+ --apply_chat_template \
138
+ --fewshot_as_multiturn \
139
+ --tasks openllm \
140
+ --batch_size auto
141
+ ```
142
+
143
+ ### Accuracy
144
+
145
+ #### Open LLM Leaderboard evaluation scores
146
+ <table>
147
+ <tr>
148
+ <th>Benchmark
149
+ </th>
150
+ <th>Qwen2.5-7B-Instruct
151
+ </th>
152
+ <th>Qwen2.5-7B-Instruct-FP8-dynamic<br>(this model)
153
+ </th>
154
+ <th>Recovery
155
+ </th>
156
+ </tr>
157
+ <tr>
158
+ <td>MMLU (5-shot)
159
+ </td>
160
+ <td>74.24
161
+ </td>
162
+ <td>74.04
163
+ </td>
164
+ <td>99.7%
165
+ </td>
166
+ </tr>
167
+ <tr>
168
+ <td>ARC Challenge (25-shot)
169
+ </td>
170
+ <td>63.40
171
+ </td>
172
+ <td>63.14
173
+ </td>
174
+ <td>99.6%
175
+ </td>
176
+ </tr>
177
+ <tr>
178
+ <td>GSM-8K (5-shot, strict-match)
179
+ </td>
180
+ <td>80.36
181
+ </td>
182
+ <td>80.06
183
+ </td>
184
+ <td>99.6%
185
+ </td>
186
+ </tr>
187
+ <tr>
188
+ <td>Hellaswag (10-shot)
189
+ </td>
190
+ <td>81.52
191
+ </td>
192
+ <td>81.11
193
+ </td>
194
+ <td>99.5%
195
+ </td>
196
+ </tr>
197
+ <tr>
198
+ <td>Winogrande (5-shot)
199
+ </td>
200
+ <td>74.66
201
+ </td>
202
+ <td>74.43
203
+ </td>
204
+ <td>99.7%
205
+ </td>
206
+ </tr>
207
+ <tr>
208
+ <td>TruthfulQA (0-shot, mc2)
209
+ </td>
210
+ <td>64.76
211
+ </td>
212
+ <td>64.87
213
+ </td>
214
+ <td>100.2%
215
+ </td>
216
+ </tr>
217
+ <tr>
218
+ <td><strong>Average</strong>
219
+ </td>
220
+ <td><strong>73.16</strong>
221
+ </td>
222
+ <td><strong>72.94</strong>
223
+ </td>
224
+ <td><strong>99.7%</strong>
225
+ </td>
226
+ </tr>
227
+ </table>
228
+