nm-research commited on
Commit
3ccfb74
·
verified ·
1 Parent(s): 4ba3f1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -115
README.md CHANGED
@@ -78,6 +78,8 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
78
 
79
  This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below.
80
 
 
 
81
  ```python
82
  from datasets import load_dataset
83
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -158,10 +160,11 @@ model.save_pretrained(SAVE_DIR, save_compressed=True)
158
  tokenizer.save_pretrained(SAVE_DIR)
159
 
160
  ```
 
161
 
162
  ## Evaluation
163
 
164
- This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
165
 
166
  ### Accuracy
167
 
@@ -176,114 +179,126 @@ This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, an
176
  </tr>
177
  </thead>
178
  <tbody>
 
179
  <tr>
180
  <td rowspan="7"><b>OpenLLM V1</b></td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
  <td>mmlu</td>
182
- <td></td>
183
- <td></td>
184
- <td></td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  </tr>
186
- <tr>
187
- <td>MMLU</td>
188
- <td></td>
189
- <td></td>
190
- <td></td>
191
- </tr>
192
- <tr>
193
- <td>ARC Challenge (0-shot)</td>
194
- <td></td>
195
- <td></td>
196
- <td></td>
197
- </tr>
198
- <tr>
199
- <td>GSM8K (8-shot, strict-match)</td>
200
- <td></td>
201
- <td></td>
202
- <td></td>
203
- </tr>
204
- <tr>
205
- <td>Hellaswag (10-shot)</td>
206
- <td></td>
207
- <td></td>
208
- <td></td>
209
- </tr>
210
- <tr>
211
- <td>Winogrande (5-shot)</td>
212
- <td></td>
213
- <td></td>
214
- <td></td>
215
- </tr>
216
- <tr>
217
- <td>TruthfulQA (0-shot, mc2)</td>
218
- <td></td>
219
- <td></td>
220
- <td></td>
221
- </tr>
222
- <tr>
223
- <td><b>Average</b></td>
224
- <td><b></b></td>
225
- <td><b></b></td>
226
- <td><b>%</b></td>
227
- </tr>
228
- <tr>
229
- <td rowspan="7"><b>OpenLLM V2</b></td>
230
- <td>MMLU-Pro (5-shot)</td>
231
- <td></td>
232
- <td></td>
233
- <td></td>
234
- </tr>
235
- <tr>
236
- <td>IFEval (0-shot)</td>
237
- <td></td>
238
- <td></td>
239
- <td></td>
240
- </tr>
241
- <tr>
242
- <td>BBH (3-shot)</td>
243
- <td></td>
244
- <td></td>
245
- <td></td>
246
- </tr>
247
- <tr>
248
- <td>Math-|v|-5 (4-shot)</td>
249
- <td></td>
250
- <td></td>
251
- <td></td>
252
- </tr>
253
- <tr>
254
- <td>GPQA (0-shot)</td>
255
- <td></td>
256
- <td></td>
257
- <td></td>
258
- </tr>
259
- <tr>
260
- <td>MuSR (0-shot)</td>
261
- <td></td>
262
- <td></td>
263
- <td></td>
264
- </tr>
265
- <tr>
266
- <td><b>Average</b></td>
267
- <td><b></b></td>
268
- <td><b></b></td>
269
- <td><b>%</b></td>
270
- </tr>
271
-
272
- <tr>
273
- <td><b>Coding</b></td>
274
- <td>HumanEval pass@1</td>
275
- <td></td>
276
- <td></td>
277
- <td></td>
278
- </tr>
279
- <tr>
280
- <td></td>
281
- <td>HumanEval_64 pass@2</td>
282
- <td></td>
283
- <td></td>
284
- <td></td>
285
- </tr>
286
- </tbody>
287
  </table>
288
 
289
 
@@ -291,6 +306,8 @@ This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, an
291
 
292
  The results were obtained using the following commands:
293
 
 
 
294
  #### OpenLLM v1
295
  ```
296
  lm_eval \
@@ -314,22 +331,42 @@ lm_eval \
314
  --batch_size auto
315
  ```
316
 
317
- #### HumanEval and HumanEval_64
318
  ```
319
  lm_eval \
320
  --model vllm \
321
  --model_args pretrained="RedHatAI/Qwen3-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
322
  --apply_chat_template \
323
  --fewshot_as_multiturn \
324
- --tasks humaneval_instruct \
325
  --batch_size auto
 
326
 
 
327
 
328
- lm_eval \
329
- --model vllm \
330
- --model_args pretrained="RedHatAI/Qwen3-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
331
- --apply_chat_template \
332
- --fewshot_as_multiturn \
333
- --tasks humaneval_64_instruct \
334
- --batch_size auto
335
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below.
80
 
81
+ <details>
82
+
83
  ```python
84
  from datasets import load_dataset
85
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
160
  tokenizer.save_pretrained(SAVE_DIR)
161
 
162
  ```
163
+ </details>
164
 
165
  ## Evaluation
166
 
167
+ This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval).
168
 
169
  ### Accuracy
170
 
 
179
  </tr>
180
  </thead>
181
  <tbody>
182
+ <!-- OpenLLM V1 (Core) -->
183
  <tr>
184
  <td rowspan="7"><b>OpenLLM V1</b></td>
185
+ <td>arc_challenge</td>
186
+ <td>70.65</td>
187
+ <td>70.22</td>
188
+ <td>99.39</td>
189
+ </tr>
190
+ <tr>
191
+ <td>gsm8k</td>
192
+ <td>74.15</td>
193
+ <td>74.68</td>
194
+ <td>100.71</td>
195
+ </tr>
196
+ <tr>
197
+ <td>hellaswag</td>
198
+ <td>84.00</td>
199
+ <td>83.33</td>
200
+ <td>99.20</td>
201
+ </tr>
202
+ <tr>
203
  <td>mmlu</td>
204
+ <td>81.84</td>
205
+ <td>81.23</td>
206
+ <td>99.25</td>
207
+ </tr>
208
+ <tr>
209
+ <td>truthfulqa_mc2</td>
210
+ <td>59.36</td>
211
+ <td>58.92</td>
212
+ <td>99.26</td>
213
+ </tr>
214
+ <tr>
215
+ <td>winogrande</td>
216
+ <td>75.93</td>
217
+ <td>76.80</td>
218
+ <td>101.15</td>
219
+ </tr>
220
+ <tr>
221
+ <td><b>Average</b></td>
222
+ <td><b>74.32</b></td>
223
+ <td><b>74.20</b></td>
224
+ <td><b>99.83</b></td>
225
+ </tr>
226
+ <tr>
227
+ <td rowspan="7"><b>OpenLLM V2</b></td>
228
+ <td>BBH (3-shot)</td>
229
+ <td>62.35</td>
230
+ <td>60.72</td>
231
+ <td>97.39</td>
232
+ </tr>
233
+ <tr>
234
+ <td>MMLU-Pro (5-shot)</td>
235
+ <td>54.39</td>
236
+ <td>51.13</td>
237
+ <td>94.01</td>
238
+ </tr>
239
+ <tr>
240
+ <td>MuSR (0-shot)</td>
241
+ <td>39.29</td>
242
+ <td>41.01</td>
243
+ <td>104.38</td>
244
+ </tr>
245
+ <tr>
246
+ <td>IFEval (0-shot)</td>
247
+ <td>88.97</td>
248
+ <td>87.29</td>
249
+ <td>98.11</td>
250
+ </tr>
251
+ <tr>
252
+ <td>GPQA (0-shot)</td>
253
+ <td>30.12</td>
254
+ <td>30.29</td>
255
+ <td>100.56</td>
256
+ </tr>
257
+ <tr>
258
+ <td>Math-|v|-5 (4-shot)</td>
259
+ <td>58.99</td>
260
+ <td>56.27</td>
261
+ <td>95.39</td>
262
+ </tr>
263
+ <tr>
264
+ <td><b>Average</b></td>
265
+ <td><b>55.69</b></td>
266
+ <td><b>54.45</b></td>
267
+ <td><b>97.79</b></td>
268
+ </tr>
269
+ <tr>
270
+ <td><b>Coding</b></td>
271
+ <td>HumanEval_64 pass@2</td>
272
+ <td>90.14</td>
273
+ <td>90.40</td>
274
+ <td>100.29</td>
275
+ </tr>
276
+ <tr>
277
+ <td rowspan="4"><b>Reasoning</b></td>
278
+ <td>AIME24 (0-shot)</td>
279
+ <td>75.86</td>
280
+ <td>68.97</td>
281
+ <td>90.93</td>
282
+ </tr>
283
+ <tr>
284
+ <td>AIME25 (0-shot)</td>
285
+ <td>72.41</td>
286
+ <td>65.52</td>
287
+ <td>90.52</td>
288
  </tr>
289
+ <tr>
290
+ <td>GPQA (Diamond, 0-shot)</td>
291
+ <td>62.94</td>
292
+ <td>64.47</td>
293
+ <td>102.43</td>
294
+ </tr>
295
+ <tr>
296
+ <td><b>Average</b></td>
297
+ <td><b>70.40</b></td>
298
+ <td><b>66.32</b></td>
299
+ <td><b>94.21</b></td>
300
+ </tr>
301
+ </tbody>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
302
  </table>
303
 
304
 
 
306
 
307
  The results were obtained using the following commands:
308
 
309
+ <details>
310
+
311
  #### OpenLLM v1
312
  ```
313
  lm_eval \
 
331
  --batch_size auto
332
  ```
333
 
334
+ #### HumanEval_64
335
  ```
336
  lm_eval \
337
  --model vllm \
338
  --model_args pretrained="RedHatAI/Qwen3-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
339
  --apply_chat_template \
340
  --fewshot_as_multiturn \
341
+ --tasks humaneval_64_instruct \
342
  --batch_size auto
343
+ ```
344
 
345
+ #### LightEval
346
 
347
+ ```
348
+ # --- model_args.yaml ---
349
+ cat > model_args.yaml <<'YAML'
350
+ model_parameters:
351
+ model_name: "RedHatAI/Qwen3-8B-NVFP4"
352
+ dtype: auto
353
+ gpu_memory_utilization: 0.9
354
+ tensor_parallel_size: 2
355
+ max_model_length: 40960
356
+ generation_parameters:
357
+ seed: 42
358
+ temperature: 0.6
359
+ top_k: 20
360
+ top_p: 0.95
361
+ min_p: 0.0
362
+ max_new_tokens: 32768
363
+ YAML
364
+
365
+ lighteval vllm model_args.yaml \
366
+ "lighteval|aime24|0,lighteval|aime25|0,lighteval|gpqa:diamond|0" \
367
+ --max-samples -1 \
368
+ --output-dir out_dir
369
+
370
+ ```
371
+
372
+ </details>