馃殌 Quantized Llama-3.1-8B-Instruct Model

This is a 4-bit quantized version of the meta-llama/Llama-3.1-8B-Instruct model, optimized for efficient inference on resource-constrained environments like Google Colab's NVIDIA T4 GPU.

馃 Model Description

The model was quantized using the bitsandbytes library to reduce memory usage while maintaining performance for instruction-following tasks.

馃М Quantization Details

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Quantization Method: 4-bit (NormalFloat4, NF4) with double quantization
  • Compute Dtype: float16
  • Library: bitsandbytes==0.43.3
  • Framework: transformers==4.45.1
  • Hardware: NVIDIA T4 GPU (16GB VRAM) in Google Colab
  • Date: Quantized on June 20, 2025

馃摝 Files Included

  • README.md: This file
  • config.json, pytorch_model.bin (or sharded checkpoints): Model weights
  • special_tokens_map.json, tokenizer.json, tokenizer_config.json: Tokenizer files

Usage

To load and use the quantized model for inference:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch

# Define quantization configuration
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    "your-username/quantized_Llama-3.1-8B-Instruct",  # Replace with your Hugging Face repo ID
    quantization_config=quant_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-username/quantized_Llama-3.1-8B-Instruct")

# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Perform inference
prompt = "Hello, how can I assist you today?"
output = generator(prompt, max_length=50, num_return_sequences=1)
print(output)

Quantization Process

The model was quantized in Google Colab using the following script:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from huggingface_hub import login

# Log in to Hugging Face
login()  # Requires a Hugging Face token

# Define quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# Load and quantize the model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token

# Save the quantized model
quant_path = "/content/quantized_Llama-3.1-8B-Instruct"
model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)

Requirements

  • Hardware: NVIDIA GPU with CUDA 11.4+ (e.g., T4, A100)
  • Python: 3.10+
  • Dependencies:
    • transformers==4.45.1
    • bitsandbytes==0.43.3
    • accelerate==0.33.0
    • torch (with CUDA support)

Notes

  • The quantized model is stored in /content/quantized_Llama-3.1-8B-Instruct in the Colab environment.
  • Due to Colab's ephemeral storage, consider pushing to Hugging Face Hub or saving to Google Drive for persistence.
  • Access to the base model requires a Hugging Face token and approval from Meta AI.

License

This model inherits the license of the base model meta-llama/Llama-3.1-8B-Instruct. Refer to the original model card: Meta AI Llama 3.1-8B-Instruct.

Acknowledgments

  • Created using Hugging Face Transformers and bitsandbytes for quantization.
  • Quantized in Google Colab with a T4 GPU on June 20, 2025.
Downloads last month
101
Safetensors
Model size
4.65B params
Tensor type
F16
F32
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support