File size: 4,943 Bytes

d5af306
 
b9f0b43
 
 
d5af306
b9f0b43
b2bcfdf
ca5bc36
b90c270
 
b9f0b43
eaae4c2
 
b9f0b43
 
b90c270
 
990a6b0
b90c270
 
b9f0b43
 
 
 
 
 
4c87327
b9f0b43
 
 
 
 
4c87327
b9f0b43
 
 
 
 
 
 
 
 
 
ca4fddb
b9f0b43
71a10c6
ca4fddb
 
b9f0b43
ca4fddb
b9f0b43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9aee5a9
b9f0b43
 
9aee5a9
b9f0b43

---
license: llama2
train: false
inference: false
pipeline_tag: text-generation
---

This is an experimental <a href="https://github.com/mobiusml/hqq/">HQQ</a> 2-bit quantized <a href="https://huggingface.co/meta-llama/Llama-2-7b-chat-hf"> Llama2-7B-chat model </a> using a low-rank adapter to improve the performance (referred to as <a href="https://mobiusml.github.io/1bit_blog/">HQQ+</a>).  

Quantizing small models at extreme low-bits is a challenging task. The purpose of this model is to show the community what to expect when fine-tuning such models. 
We notice that, when given more specialized data, the low-bit model can even outperform the full-precision model at some tasks.

This version offloads the meta-data to the CPU, so only the 2-bit weights and the low-rank adapters are stored in the GPU memory. 

## Datasets
The adapter was trained via SFT on random subsets of the following:

### Base Model
* <a href="https://huggingface.co/datasets/wikitext">wikitext-2-raw-v1</a> (full)

### Chat Model
* <a href="https://huggingface.co/datasets/timdettmers/openassistant-guana"> timdettmers/openassistant-guanaco </a> (full)
* <a href="https://huggingface.co/datasets/icrosoft/orca-math-word-problems-200k"> microsoft/orca-math-word-problems-200k </a> (10K)
* <a href="https://huggingface.co/datasets/meta-math/MetaMathQA"> meta-math/MetaMathQA </a> (10K)
* <a href="https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized"> HuggingFaceH4/ultrafeedback_binarized </a> (10K - chosen answers only)

## Performance
| Models            | Llama2-7B (fp16)| Llama2-7B (HQQ 2-bit)| Llama2-7B (HQQ+ 2-bit)| Quip# (2-bit)|
|-------------------|------------------|------------------|------------------|------------------|
| Wiki Perpexlity   | 5.18             |             6.06 |      <b>5.14</b> |         8.54     |
| VRAM (GB)         |    13.5          |      <b>2.6</b>  |    2.69          |         2.72     |
| forward time (sec)|   <b>0.1<b>      |    0.221         |     0.27         |      0.353       |

| Models            | Llama2-7B-chat (fp16)| Llama2-7B-chat (HQQ 2-bit)| Llama2-7B-chat (HQQ+ 2-bit)|
|-------------------|------------------|------------------|------------------|
| ARC (25-shot)     |    53.67         |  45.56   |  47.01  |
| HellaSwag (10-shot)|   78.56         |  73.59   |  73.74  |
| MMLU (5-shot)     |    48.16         |  43.18   | 43.33   |
| TruthfulQA-MC2    |    45.32         |  43.1    |  42.66  |
| Winogrande (5-shot)|   72.53         |   67.32  |  71.51  |
| GSM8K (5-shot)    |    23.12         |   9.7    |   28.43 |
| Average           |    53.56         |  47.08   |  51.11  |

## Usage
To run the model, install the HQQ library:
```
#This model is deprecated and requires an older version 
pip install hqq==0.1.8
pip install transformers==4.46.0
```
and use it as follows:
``` Python
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

#Load the model
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq' 
model     = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)

#Setup Inference Mode
tokenizer.add_bos_token = False
tokenizer.add_eos_token = False
if not tokenizer.pad_token: tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.config.use_cache  = True
model.eval();

# Optional: torch compile for faster inference
# model = torch.compile(model)

#Streaming Inference
import torch, transformers
from threading import Thread

def chat_processor(chat, max_new_tokens=100, do_sample=True, device='cuda'):
    tokenizer.use_default_system_prompt = False
    streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)

    generate_params = dict(
        tokenizer("<s> [INST] " + chat + " [/INST] ", return_tensors="pt").to(device),
        streamer=streamer,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        pad_token_id=tokenizer.pad_token_id,
        top_p=0.90 if do_sample else None,
        top_k=50 if do_sample else None,
        temperature= 0.6 if do_sample else None,
        num_beams=1,
        repetition_penalty=1.2,
    )

    t = Thread(target=model.generate, kwargs=generate_params)
    t.start()
    
    print("User: ", chat); 
    print("Assistant: ");
    outputs = ""
    for text in streamer:
        outputs += text
        print(text, end="", flush=True)

    torch.cuda.empty_cache()
  
    return outputs
```

### Example
``` Python
outputs = chat_processor("If you had 5 apples yesterday and you ate 2 today morning, how many apples do you have this evening?", max_new_tokens=1000, do_sample=False)
```
```
User:  If you had 5 apples yesterday and you ate 2 today morning, how many apples do you have this evening?
Assistant: 
You started with 5 apples.You ate 2 of them so now you have 5-2=3 apples left.So by the evening you will still have 3 apples.
```