Llama-3.3-70B-Instruct-FP8

This is a quantized version of Meta's Llama 3.3 70B Instruct model using FP8 quantization with compressed-tensors.

Model Details

  • Base Model: meta-llama/Llama-3.3-70B-Instruct
  • Quantization: FP8 (8-bit float) using compressed-tensors
  • Model Size: ~70.5B parameters
  • Quantized Size: ~72.7GB (significantly reduced from original)
  • Architecture: LlamaForCausalLM
  • Context Length: 131,072 tokens
  • Quantization Method: compressed-tensors v0.12.2

Quantization Details

This model uses FP8 quantization with the following configuration:

  • Format: float-quantized
  • Bits: 8-bit float
  • Strategy: tensor-level quantization
  • Observer: minmax
  • Symmetric: true
  • Target Layers: Linear layers (excluding lm_head)

Usage

Using Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the model and tokenizer
model_name = "your-username/Llama-3.3-70B-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example usage
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using with Compressed-Tensors

from transformers import AutoTokenizer, AutoModelForCausalLM
from compressed_tensors import load_model

# Load the quantized model
model_name = "your-username/Llama-3.3-70B-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = load_model(model_name, device_map="auto")

Performance

  • Memory Usage: Significantly reduced compared to the original model
  • Speed: Faster inference due to reduced precision
  • Quality: Minimal quality degradation with FP8 quantization

Requirements

  • transformers >= 4.55.4
  • compressed-tensors >= 0.12.2
  • torch
  • accelerate

Model Card

This model is a quantized version of Meta's Llama 3.3 70B Instruct model. The original model is designed for instruction-following tasks and general language understanding. The FP8 quantization reduces memory requirements while maintaining high performance.

Training Data

This model inherits the training data from the base Llama 3.3 70B Instruct model.

Intended Use

This model is intended for:

  • Text generation
  • Instruction following
  • Code generation
  • General language understanding tasks

Limitations

  • Quantization may introduce minor quality degradation
  • Requires compatible hardware for optimal performance
  • Large model size still requires significant computational resources

Citation

@misc{llama33instruct,
  title={Llama 3.3 70B Instruct},
  author={Meta AI},
  year={2024},
  url={https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct}
}

License

This model is released under the Llama 3.1 Community License. Please review the license terms before use.

Downloads last month
464
Safetensors
Model size
71B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for philkuz/llama-3.3-70b-instruct-fp8

Quantized
(134)
this model