|
--- |
|
license: gemma |
|
library_name: transformers |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- int4 |
|
- vllm |
|
- llmcompressor |
|
base_model: google/gemma-3-12b-it |
|
--- |
|
|
|
# gemma-3-12b-it-GPTQ-4b-128g |
|
|
|
## Model Overview |
|
|
|
This model was obtained by quantizing the weights of [gemma-3-12b-it](https://huggingface.co/google/gemma-3-12b-it) to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. |
|
|
|
Only the weights of the linear operators within `language_model` transformers blocks are quantized. Vision model and multimodal projection are kept in original precision. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization. |
|
|
|
Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format. |
|
|
|
## Evaluation |
|
|
|
This model was evaluated on the OpenLLM v1 benchmarks. Model outputs were generated with the `vLLM` engine. |
|
|
|
| Model | ArcC | GSM8k | Hellaswag | MMLU | TruthfulQA-mc2 | Winogrande | Average | Recovery | |
|
|----------------------------|:------:|:------:|:---------:|:------:|:--------------:|:----------:|:-------:|:--------:| |
|
| gemma-3-12b-it | 0.7125 | 0.8719 | 0.8377 | 0.7230 | 0.5798 | 0.7893 | 0.7524 | 1.0000 | |
|
| gemma-3-12b-it-INT4 (this) | 0.6988 | 0.8643 | 0.8254 | 0.7078 | 0.5638 | 0.7830 | 0.7405 | 0.9842 | |
|
|
|
## Reproduction |
|
|
|
The results were obtained using the following commands: |
|
|
|
```bash |
|
MODEL=ISTA-DASLab/gemma-3-12b-it-GPTQ-4b-128g |
|
MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.80" |
|
|
|
lm_eval \ |
|
--model vllm \ |
|
--model_args $MODEL_ARGS \ |
|
--tasks openllm \ |
|
--batch_size auto |
|
``` |
|
|
|
|
|
## Usage |
|
|
|
* To use the model in `transformers` update the package to stable release of Gemma3: |
|
|
|
`pip install git+https://github.com/huggingface/[email protected]` |
|
* To use the model in `vLLM` update the package to version after this [PR](https://github.com/vllm-project/vllm/pull/14660/files). |
|
|
|
And example of inference via transformers is provided below: |
|
|
|
```python |
|
# pip install accelerate |
|
|
|
from transformers import AutoProcessor, Gemma3ForConditionalGeneration |
|
from PIL import Image |
|
import requests |
|
import torch |
|
|
|
model_id = "ISTA-DASLab/gemma-3-12b-it-GPTQ-4b-128g" |
|
|
|
model = Gemma3ForConditionalGeneration.from_pretrained( |
|
model_id, device_map="auto" |
|
).eval() |
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": [{"type": "text", "text": "You are a helpful assistant."}] |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, |
|
{"type": "text", "text": "Describe this image in detail."} |
|
] |
|
} |
|
] |
|
|
|
inputs = processor.apply_chat_template( |
|
messages, add_generation_prompt=True, tokenize=True, |
|
return_dict=True, return_tensors="pt" |
|
).to(model.device, dtype=torch.bfloat16) |
|
|
|
input_len = inputs["input_ids"].shape[-1] |
|
|
|
with torch.inference_mode(): |
|
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False) |
|
generation = generation[0][input_len:] |
|
|
|
decoded = processor.decode(generation, skip_special_tokens=True) |
|
print(decoded) |
|
|
|
# **Overall Impression:** The image is a close-up shot of a vibrant garden scene, |
|
# focusing on a cluster of pink cosmos flowers and a busy bumblebee. |
|
# It has a slightly soft, natural feel, likely captured in daylight. |
|
``` |