mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf

This is an HQQ-quantized version (4-bit, group-size=64) of the gemma-3-12b-it model.

Performance

Models	bfp16	HQQ 4-bit gs-64	QAT 4-bit gs-32
ARC (25-shot)	0.724	0.701	0.690
HellaSwag (10-shot)	0.839	0.826	0.792
MMLU (5-shot)	0.730	0.724	0.693
TruthfulQA-MC2	0.580	0.585	0.550
Winogrande (5-shot)	0.766	0.774	0.755
GSM8K (5-shot)	0.874	0.862	0.808
Average	0.752	0.745	0.715

Usage

#use transformers up to 52cc204dd7fbd671452448028aae6262cea74dc2
#pip install git+https://github.com/huggingface/transformers@52cc204dd7fbd671452448028aae6262cea74dc2

import torch
backend       = "gemlite" 
compute_dtype = torch.bfloat16 
cache_dir     = None
model_id      = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf'

#Load model
from transformers import Gemma3ForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=compute_dtype,
    attn_implementation="sdpa",
    cache_dir=cache_dir,
    device_map="cuda",
)

#Optimize
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model.language_model, backend=backend, verbose=True)


############################################################################
#Inference
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:]
    decoded    = processor.decode(generation, skip_special_tokens=True)

print(decoded)

mobiuslabsgmbh
/

gemma-3-12b-it_4bitgs64_bfp16_hqq_hf

Performance

Usage

Model tree for mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf