File size: 4,943 Bytes
d5af306 b9f0b43 d5af306 b9f0b43 b2bcfdf ca5bc36 b90c270 b9f0b43 eaae4c2 b9f0b43 b90c270 990a6b0 b90c270 b9f0b43 4c87327 b9f0b43 4c87327 b9f0b43 ca4fddb b9f0b43 71a10c6 ca4fddb b9f0b43 ca4fddb b9f0b43 9aee5a9 b9f0b43 9aee5a9 b9f0b43 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: llama2
train: false
inference: false
pipeline_tag: text-generation
---
This is an experimental <a href="https://github.com/mobiusml/hqq/">HQQ</a> 2-bit quantized <a href="https://huggingface.co/meta-llama/Llama-2-7b-chat-hf"> Llama2-7B-chat model </a> using a low-rank adapter to improve the performance (referred to as <a href="https://mobiusml.github.io/1bit_blog/">HQQ+</a>).
Quantizing small models at extreme low-bits is a challenging task. The purpose of this model is to show the community what to expect when fine-tuning such models.
We notice that, when given more specialized data, the low-bit model can even outperform the full-precision model at some tasks.
This version offloads the meta-data to the CPU, so only the 2-bit weights and the low-rank adapters are stored in the GPU memory.
## Datasets
The adapter was trained via SFT on random subsets of the following:
### Base Model
* <a href="https://huggingface.co/datasets/wikitext">wikitext-2-raw-v1</a> (full)
### Chat Model
* <a href="https://huggingface.co/datasets/timdettmers/openassistant-guana"> timdettmers/openassistant-guanaco </a> (full)
* <a href="https://huggingface.co/datasets/icrosoft/orca-math-word-problems-200k"> microsoft/orca-math-word-problems-200k </a> (10K)
* <a href="https://huggingface.co/datasets/meta-math/MetaMathQA"> meta-math/MetaMathQA </a> (10K)
* <a href="https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized"> HuggingFaceH4/ultrafeedback_binarized </a> (10K - chosen answers only)
## Performance
| Models | Llama2-7B (fp16)| Llama2-7B (HQQ 2-bit)| Llama2-7B (HQQ+ 2-bit)| Quip# (2-bit)|
|-------------------|------------------|------------------|------------------|------------------|
| Wiki Perpexlity | 5.18 | 6.06 | <b>5.14</b> | 8.54 |
| VRAM (GB) | 13.5 | <b>2.6</b> | 2.69 | 2.72 |
| forward time (sec)| <b>0.1<b> | 0.221 | 0.27 | 0.353 |
| Models | Llama2-7B-chat (fp16)| Llama2-7B-chat (HQQ 2-bit)| Llama2-7B-chat (HQQ+ 2-bit)|
|-------------------|------------------|------------------|------------------|
| ARC (25-shot) | 53.67 | 45.56 | 47.01 |
| HellaSwag (10-shot)| 78.56 | 73.59 | 73.74 |
| MMLU (5-shot) | 48.16 | 43.18 | 43.33 |
| TruthfulQA-MC2 | 45.32 | 43.1 | 42.66 |
| Winogrande (5-shot)| 72.53 | 67.32 | 71.51 |
| GSM8K (5-shot) | 23.12 | 9.7 | 28.43 |
| Average | 53.56 | 47.08 | 51.11 |
## Usage
To run the model, install the HQQ library:
```
#This model is deprecated and requires an older version
pip install hqq==0.1.8
pip install transformers==4.46.0
```
and use it as follows:
``` Python
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
#Load the model
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq'
model = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)
#Setup Inference Mode
tokenizer.add_bos_token = False
tokenizer.add_eos_token = False
if not tokenizer.pad_token: tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.config.use_cache = True
model.eval();
# Optional: torch compile for faster inference
# model = torch.compile(model)
#Streaming Inference
import torch, transformers
from threading import Thread
def chat_processor(chat, max_new_tokens=100, do_sample=True, device='cuda'):
tokenizer.use_default_system_prompt = False
streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
generate_params = dict(
tokenizer("<s> [INST] " + chat + " [/INST] ", return_tensors="pt").to(device),
streamer=streamer,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
pad_token_id=tokenizer.pad_token_id,
top_p=0.90 if do_sample else None,
top_k=50 if do_sample else None,
temperature= 0.6 if do_sample else None,
num_beams=1,
repetition_penalty=1.2,
)
t = Thread(target=model.generate, kwargs=generate_params)
t.start()
print("User: ", chat);
print("Assistant: ");
outputs = ""
for text in streamer:
outputs += text
print(text, end="", flush=True)
torch.cuda.empty_cache()
return outputs
```
### Example
``` Python
outputs = chat_processor("If you had 5 apples yesterday and you ate 2 today morning, how many apples do you have this evening?", max_new_tokens=1000, do_sample=False)
```
```
User: If you had 5 apples yesterday and you ate 2 today morning, how many apples do you have this evening?
Assistant:
You started with 5 apples.You ate 2 of them so now you have 5-2=3 apples left.So by the evening you will still have 3 apples.
```
|