Llama-3.3-70B-Instruct-4bit (LeanQuant)

This is a 4-bit quantized version of embraceableAI/e2-llama-v3p3-70B-Merged-v1, using LeanQuant for optimized memory and inference speed.
It is suitable for instruction following, dialogue, and general-purpose generation on memory-constrained hardware.

🧠 Model Details

  • Base model: EmbraceableAI LLaMA-3.3 70B merged checkpoint
  • Quantization: 4-bit via LeanQuant
  • File: Llama-3.3-70B-Instruct-4bit.safetensors
  • Size: ~36GB
  • Format: safetensors
  • Device support: Multi-GPU via device_map="auto"

πŸ§ͺ Intended Use

  • Instruction following (chat-style)

πŸš€ Usage Example

import torch
from leanquant import LeanQuantModelForCausalLM
from transformers import AutoTokenizer

### Load model and tokenizer
base_model_name = "embraceableAI/e2-llama-v3p3-70B-Merged-v1"
model = LeanQuantModelForCausalLM.from_pretrained(
    base_model_name,
    "./model.safetensors",
    bits=4,
    device_map="auto"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

### Tokenize prompt
prompt = [
    {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
    {"role": "user", "content": "What is quantization for deep learning models?"},
]
inputs = tokenizer.apply_chat_template(
    prompt,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

### Run generation and decode generated tokens
with torch.no_grad():
    output = model.generate(**inputs, do_sample=True, max_new_tokens=256)

generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(generated_text)

> πŸ“˜ **Try it in Colab for quantization**:  
> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1RGfgqQm4XVmEWQVph5-4D3xmYGbAwEwW)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support