mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq

This is an experimental HQQ all 2-bit (group-size=64) quantized Llama3-8B-Instruct model.

Llama3-8B is known to be relatively difficult to quantize, espcially at lower bits, as pointed out by https://arxiv.org/abs/2404.14047.
This 2-bit model has been calibrated with a low-rank adapter (HQQ+) to significantly improve the quality, since one-shot quantization with 2-bit results in signficant quality loss. Moreover, this model is fully compatible with BitBlas and torch.compile for fast inference.

Model Size

Models	fp16	HQQ+ 2-bit/gs-64
Bitrate (Linear layers)	16	2.63
VRAM	15.7 (GB)	4.3 (GB)

Model Decoding Speed

Models	fp16	HQQ+ 2-bit/gs-64
Decoding* - short seq (tokens/sec)	53	120
Decoding* - long seq (tokens/sec)	50	95

*: RTX 3090

Performance

Models	fp16	HQQ+ 2-bit/gs-64
ARC (25-shot)	62.2	38.82
HellaSwag (10-shot)	78.78	61.09
MMLU (5-shot)	67.06	38.02
TruthfulQA-MC2	51.65	50.08
Winogrande (5-shot)	75.85	63.22
GSM8K (5-shot)	75.97	26.31
Average	68.59	46.26

While this is significantly better than the best 2-bit Llama3-8B model reported in https://arxiv.org/abs/2404.14047 (DB-LLM: 42.1 for HellaSwag and 60.4 for Winograde), it looks like it's actually better to just use a 4-bit Llama2-7B-chat instead.

Usage

First, install the dependecies:

pip install git+https://github.com/mobiusml/hqq
pip install bitblas

Then you can use the sample code below:

import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *
from hqq.utils.patching import *
from hqq.utils.generation_hf import HFGenerator

#Settings
###################################################
backend       = "bitblas" #bitblas or gemlite for 2-bit runtime
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
device        = 'cuda:0'
cache_dir     = '.'

#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq' 
model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype, device=device, adapter='adapter_v0.1.lora').eval();
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

#Use optimized inference kernels
###################################################
prepare_for_inference(model, backend=backend) #It takes a while...

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
#gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile=None) #Slower generation but no warm-up 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Faster generation, but warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

mobiuslabsgmbh
/

Llama-3-8b-instruct_2bitgs64_hqq

Model Size

Model Decoding Speed

Performance

Usage

Collection including mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq

Llama3 HQQ