This model has been quantized using GPTQModel.
- bits: 4
- group_size: 128
- desc_act: true
- static_groups: false
- sym: true
- lm_head: false
- damp_percent: 0.01
- true_sequential: true
- model_name_or_path: ""
- model_file_base_name: "model"
- quant_method: "gptq"
- checkpoint_format: "gptq"
- meta:
- quantizer: "gptqmodel:0.9.9-dev0"
Currently, only vllm can load the quantized gemma2-27b for proper inference. Here is an example:
import os
# Gemma-2 use Flashinfer backend for models with logits_soft_cap. Otherwise, the output might be wrong.
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'
from transformers import AutoTokenizer
from gptqmodel import BACKEND, GPTQModel
model_name = "ModelCloud/gemma-2-27b-it-gptq-4bit"
prompt = [{"role": "user", "content": "I am in Shanghai, preparing to visit the natural history museum. Can you tell me the best way to"}]
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPTQModel.from_quantized(
model_name,
backend=BACKEND.VLLM,
)
inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = model.generate(prompts=inputs, temperature=0.95, max_length=128)
print(outputs[0].outputs[0].text)
- Downloads last month
- 3,166
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.