nm-research commited on
Commit
9c6d697
·
verified ·
1 Parent(s): 8d77f22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -32
README.md CHANGED
@@ -66,7 +66,9 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
-
 
 
70
  ```bash
71
  python quantize.py --model_path ibm-granite/granite-3.1-2b-instruct --quant_path "output_dir/granite-3.1-2b-instruct-quantized.w4a16" --calib_size 1024 --dampening_frac 0.01 --observer mse --group_size 64
72
  ```
@@ -145,6 +147,7 @@ oneshot(
145
  model.save_pretrained(quant_path, save_compressed=True)
146
  tokenizer.save_pretrained(quant_path)
147
  ```
 
148
 
149
  ## Evaluation
150
 
@@ -203,37 +206,114 @@ evalplus.evaluate \
203
 
204
  ### Accuracy
205
 
206
- #### OpenLLM Leaderboard V1 evaluation scores
207
-
208
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w4a16 |
209
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
210
- | ARC-Challenge (Acc-Norm, 25-shot) | 55.63 | 54.18 |
211
- | GSM8K (Strict-Match, 5-shot) | 60.96 | 62.85 |
212
- | HellaSwag (Acc-Norm, 10-shot) | 75.21 | 73.36 |
213
- | MMLU (Acc, 5-shot) | 54.38 | 52.17 |
214
- | TruthfulQA (MC2, 0-shot) | 55.93 | 56.83 |
215
- | Winogrande (Acc, 5-shot) | 69.67 | 69.85 |
216
- | **Average Score** | **61.98** | **61.54** |
217
- | **Recovery** | **100.00** | **99.29** |
218
-
219
- #### OpenLLM Leaderboard V2 evaluation scores
220
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w4a16 |
221
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
222
- | IFEval (Inst Level Strict Acc, 0-shot)| 67.99 | 67.63 |
223
- | BBH (Acc-Norm, 3-shot) | 44.11 | 43.22 |
224
- | Math-Hard (Exact-Match, 4-shot) | 8.66 | 8.77 |
225
- | GPQA (Acc-Norm, 0-shot) | 28.30 | 28.56 |
226
- | MUSR (Acc-Norm, 0-shot) | 35.12 | 35.26 |
227
- | MMLU-Pro (Acc, 5-shot) | 26.87 | 27.27 |
228
- | **Average Score** | **35.17** | **35.12** |
229
- | **Recovery** | **100.00** | **99.84** |
230
-
231
-
232
- #### HumanEval pass@1 scores
233
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w4a16 |
234
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
235
- | HumanEval Pass@1 | 53.40 | 52.30 |
236
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
  ## Inference Performance
239
 
 
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
+ <details>
70
+ <summary>Model Creation Code</summary>
71
+
72
  ```bash
73
  python quantize.py --model_path ibm-granite/granite-3.1-2b-instruct --quant_path "output_dir/granite-3.1-2b-instruct-quantized.w4a16" --calib_size 1024 --dampening_frac 0.01 --observer mse --group_size 64
74
  ```
 
147
  model.save_pretrained(quant_path, save_compressed=True)
148
  tokenizer.save_pretrained(quant_path)
149
  ```
150
+ </details>
151
 
152
  ## Evaluation
153
 
 
206
 
207
  ### Accuracy
208
 
209
+ <table>
210
+ <thead>
211
+ <tr>
212
+ <th>Category</th>
213
+ <th>Metric</th>
214
+ <th>ibm-granite/granite-3.1-2b-instruct</th>
215
+ <th>neuralmagic/granite-3.1-2b-instruct-quantized.w4a16</th>
216
+ <th>Recovery (%)</th>
217
+ </tr>
218
+ </thead>
219
+ <tbody>
220
+ <tr>
221
+ <td rowspan="7"><b>OpenLLM v1</b></td>
222
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
223
+ <td>55.63</td>
224
+ <td>54.18</td>
225
+ <td>97.39</td>
226
+ </tr>
227
+ <tr>
228
+ <td>GSM8K (Strict-Match, 5-shot)</td>
229
+ <td>60.96</td>
230
+ <td>62.85</td>
231
+ <td>103.10</td>
232
+ </tr>
233
+ <tr>
234
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
235
+ <td>75.21</td>
236
+ <td>73.36</td>
237
+ <td>97.54</td>
238
+ </tr>
239
+ <tr>
240
+ <td>MMLU (Acc, 5-shot)</td>
241
+ <td>54.38</td>
242
+ <td>52.17</td>
243
+ <td>95.93</td>
244
+ </tr>
245
+ <tr>
246
+ <td>TruthfulQA (MC2, 0-shot)</td>
247
+ <td>55.93</td>
248
+ <td>56.83</td>
249
+ <td>101.61</td>
250
+ </tr>
251
+ <tr>
252
+ <td>Winogrande (Acc, 5-shot)</td>
253
+ <td>69.67</td>
254
+ <td>69.85</td>
255
+ <td>100.26</td>
256
+ </tr>
257
+ <tr>
258
+ <td><b>Average Score</b></td>
259
+ <td><b>61.98</b></td>
260
+ <td><b>61.54</b></td>
261
+ <td><b>99.29</b></td>
262
+ </tr>
263
+ <tr>
264
+ <td rowspan="7"><b>OpenLLM v2</b></td>
265
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
266
+ <td>67.99</td>
267
+ <td>67.63</td>
268
+ <td>99.47</td>
269
+ </tr>
270
+ <tr>
271
+ <td>BBH (Acc-Norm, 3-shot)</td>
272
+ <td>44.11</td>
273
+ <td>43.22</td>
274
+ <td>97.98</td>
275
+ </tr>
276
+ <tr>
277
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
278
+ <td>8.66</td>
279
+ <td>8.77</td>
280
+ <td>101.27</td>
281
+ </tr>
282
+ <tr>
283
+ <td>GPQA (Acc-Norm, 0-shot)</td>
284
+ <td>28.30</td>
285
+ <td>28.56</td>
286
+ <td>100.92</td>
287
+ </tr>
288
+ <tr>
289
+ <td>MUSR (Acc-Norm, 0-shot)</td>
290
+ <td>35.12</td>
291
+ <td>35.26</td>
292
+ <td>100.40</td>
293
+ </tr>
294
+ <tr>
295
+ <td>MMLU-Pro (Acc, 5-shot)</td>
296
+ <td>26.87</td>
297
+ <td>27.27</td>
298
+ <td>101.49</td>
299
+ </tr>
300
+ <tr>
301
+ <td><b>Average Score</b></td>
302
+ <td><b>35.17</b></td>
303
+ <td><b>35.12</b></td>
304
+ <td><b>99.84</b></td>
305
+ </tr>
306
+ <tr>
307
+ <td rowspan="2"><b>HumanEval</b></td>
308
+ <td>HumanEval Pass@1</td>
309
+ <td>53.40</td>
310
+ <td>52.30</td>
311
+ <td>97.94</td>
312
+ </tr>
313
+ </tbody>
314
+ </table>
315
+
316
+
317
 
318
  ## Inference Performance
319