Chesebrough commited on
Commit
ac16bb1
·
verified ·
1 Parent(s): e852bc2

Update ReadMe.md

Browse files

added a section near bottom for testing quantizability of a model such as intel_neural_chat

Files changed (1) hide show
  1. README.md +61 -0
README.md CHANGED
@@ -247,6 +247,67 @@ The model was submitted to the [LLM Leaderboard](https://huggingface.co/spaces/H
247
  | [Intel/neural-chat-7b-v3](https://huggingface.co/Intel/neural-chat-7b-v3) | **57.31** | 67.15 | 83.29 | 62.26 | 58.77 | 78.06 | 1.21 | 50.43 |
248
  | [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) | **59.06** | 66.21 | 83.64 | 62.37 | 59.65 | 78.14 | 19.56 | 43.84 |
249
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
  ## Ethical Considerations and Limitations
251
  Neural-chat-7b-v3-1 can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
252
 
 
247
  | [Intel/neural-chat-7b-v3](https://huggingface.co/Intel/neural-chat-7b-v3) | **57.31** | 67.15 | 83.29 | 62.26 | 58.77 | 78.06 | 1.21 | 50.43 |
248
  | [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) | **59.06** | 66.21 | 83.64 | 62.37 | 59.65 | 78.14 | 19.56 | 43.84 |
249
 
250
+ ## Testing Model Quantizability
251
+ The following code block can be run to determine, for PyTorch models, if that model is amenable to quantization.
252
+ One caveat - the Intel Extension for PyTorch uses optimum ipex, which is pre-release and needs further testing.
253
+
254
+ To install the dependencies, you should first install Intel Extensions for PyTorch and tehn pip install each of the following dependencies:
255
+ - torch
256
+ - optimum.intel
257
+ - optimum[ipex]
258
+ - transformers
259
+
260
+ ### Intel Extension for PyTorch method:
261
+ In this case, we are testing if neural-chat-7b-v3-1 can be quantized and this testing method demonstrates the model size change, for example:
262
+ when the base type is specified to be torch.bfloat16 but also specifying that load_in_4bit=True which causes the weights only to be quantized we see an output from the model testing as follows:
263
+ - **model_quantize_internal: model size = 27625.02 MB**
264
+ - **model_quantize_internal: quant size = 4330.80 MB**
265
+
266
+ This code should run from within a python script - such as ipex_test.py as follows:
267
+ ```python
268
+ import torch
269
+ import os
270
+ from transformers import AutoTokenizer
271
+ from intel_extension_for_transformers.transformers import AutoModelForCausalLM, pipeline
272
+ model_name = "Intel/neural-chat-7b-v3-1"
273
+ prompt = "Once upon a time, there existed a little girl,"
274
+
275
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
276
+ inputs = tokenizer(prompt, return_tensors="pt").input_ids
277
+
278
+ result = {torch.bfloat16:"failed"}
279
+ typ = torch.bfloat16
280
+ try:
281
+ model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, torch_dtype = typ)
282
+ outputs = model.generate(inputs, max_new_tokens=20)
283
+ result[typ] = f"passed, {os.stat(model.bin_file).st_size}"
284
+ except:
285
+ result[typ] = "failed"
286
+
287
+
288
+ print("\n\nResults of quantizing: ")
289
+ # determine if Quantized
290
+ with open(r"output.log", 'r') as fp:
291
+ for l_no, line in enumerate(fp):
292
+ # search string
293
+ if 'model_quantize_internal' in line:
294
+ print(line)
295
+
296
+ print("\n\nExecution results ")
297
+ for k,v in result.items():
298
+ print(k,v)
299
+
300
+ print("\n\nModel Output: ")
301
+ tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
302
+ ```
303
+ Run the code as folows from a bash terminal:
304
+ ```bash
305
+ python ipex_test.py 2>&1 | tee output.log
306
+ ```
307
+ The entire output is captured in the output.log but it will be summarized,
308
+ along with output from the model indicating either pass or fail of the quantization as well as model output for a given prompt.
309
+
310
+
311
  ## Ethical Considerations and Limitations
312
  Neural-chat-7b-v3-1 can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
313