nm-research commited on
Commit
8d77f22
·
verified ·
1 Parent(s): c7228d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -9
README.md CHANGED
@@ -43,7 +43,7 @@ from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
- model_name = "neuralmagic-ent/granite-3.1-2b-instruct-quantized.w4a16"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -148,13 +148,16 @@ tokenizer.save_pretrained(quant_path)
148
 
149
  ## Evaluation
150
 
151
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
 
 
 
152
 
153
  OpenLLM Leaderboard V1:
154
  ```
155
  lm_eval \
156
  --model vllm \
157
- --model_args pretrained="neuralmagic-ent/granite-3.1-2b-instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
158
  --tasks openllm \
159
  --write_out \
160
  --batch_size auto \
@@ -162,11 +165,23 @@ lm_eval \
162
  --show_config
163
  ```
164
 
 
 
 
 
 
 
 
 
 
 
 
 
165
  #### HumanEval
166
  ##### Generation
167
  ```
168
  python3 codegen/generate.py \
169
- --model neuralmagic-ent/granite-3.1-2b-instruct-quantized.w4a16 \
170
  --bs 16 \
171
  --temperature 0.2 \
172
  --n_samples 50 \
@@ -176,20 +191,21 @@ python3 codegen/generate.py \
176
  ##### Sanitization
177
  ```
178
  python3 evalplus/sanitize.py \
179
- humaneval/neuralmagic-ent--granite-3.1-2b-instruct-quantized.w4a16_vllm_temp_0.2
180
  ```
181
  ##### Evaluation
182
  ```
183
  evalplus.evaluate \
184
  --dataset humaneval \
185
- --samples humaneval/neuralmagic-ent--granite-3.1-2b-instruct-quantized.w4a16_vllm_temp_0.2-sanitized
186
  ```
 
187
 
188
  ### Accuracy
189
 
190
  #### OpenLLM Leaderboard V1 evaluation scores
191
 
192
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-quantized.w4a16 |
193
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
194
  | ARC-Challenge (Acc-Norm, 25-shot) | 55.63 | 54.18 |
195
  | GSM8K (Strict-Match, 5-shot) | 60.96 | 62.85 |
@@ -201,7 +217,7 @@ evalplus.evaluate \
201
  | **Recovery** | **100.00** | **99.29** |
202
 
203
  #### OpenLLM Leaderboard V2 evaluation scores
204
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-quantized.w4a16 |
205
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
206
  | IFEval (Inst Level Strict Acc, 0-shot)| 67.99 | 67.63 |
207
  | BBH (Acc-Norm, 3-shot) | 44.11 | 43.22 |
@@ -214,7 +230,7 @@ evalplus.evaluate \
214
 
215
 
216
  #### HumanEval pass@1 scores
217
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-quantized.w4a16 |
218
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
219
  | HumanEval Pass@1 | 53.40 | 52.30 |
220
 
 
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
+ model_name = "neuralmagic/granite-3.1-2b-instruct-quantized.w4a16"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
148
 
149
  ## Evaluation
150
 
151
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
152
+
153
+ <details>
154
+ <summary>Evaluation Commands</summary>
155
 
156
  OpenLLM Leaderboard V1:
157
  ```
158
  lm_eval \
159
  --model vllm \
160
+ --model_args pretrained="neuralmagic/granite-3.1-2b-instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
161
  --tasks openllm \
162
  --write_out \
163
  --batch_size auto \
 
165
  --show_config
166
  ```
167
 
168
+ OpenLLM Leaderboard V2:
169
+ ```
170
+ lm_eval \
171
+ --model vllm \
172
+ --model_args pretrained="neuralmagic/granite-3.1-2b-instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
173
+ --tasks leaderboard \
174
+ --write_out \
175
+ --batch_size auto \
176
+ --output_path output_dir \
177
+ --show_config
178
+ ```
179
+
180
  #### HumanEval
181
  ##### Generation
182
  ```
183
  python3 codegen/generate.py \
184
+ --model neuralmagic/granite-3.1-2b-instruct-quantized.w4a16 \
185
  --bs 16 \
186
  --temperature 0.2 \
187
  --n_samples 50 \
 
191
  ##### Sanitization
192
  ```
193
  python3 evalplus/sanitize.py \
194
+ humaneval/neuralmagic--granite-3.1-2b-instruct-quantized.w4a16_vllm_temp_0.2
195
  ```
196
  ##### Evaluation
197
  ```
198
  evalplus.evaluate \
199
  --dataset humaneval \
200
+ --samples humaneval/neuralmagic--granite-3.1-2b-instruct-quantized.w4a16_vllm_temp_0.2-sanitized
201
  ```
202
+ </details>
203
 
204
  ### Accuracy
205
 
206
  #### OpenLLM Leaderboard V1 evaluation scores
207
 
208
+ | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w4a16 |
209
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
210
  | ARC-Challenge (Acc-Norm, 25-shot) | 55.63 | 54.18 |
211
  | GSM8K (Strict-Match, 5-shot) | 60.96 | 62.85 |
 
217
  | **Recovery** | **100.00** | **99.29** |
218
 
219
  #### OpenLLM Leaderboard V2 evaluation scores
220
+ | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w4a16 |
221
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
222
  | IFEval (Inst Level Strict Acc, 0-shot)| 67.99 | 67.63 |
223
  | BBH (Acc-Norm, 3-shot) | 44.11 | 43.22 |
 
230
 
231
 
232
  #### HumanEval pass@1 scores
233
+ | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic/granite-3.1-2b-instruct-quantized.w4a16 |
234
  |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
235
  | HumanEval Pass@1 | 53.40 | 52.30 |
236