Llama-3.2-1B-Instruct-Nova-FP8

This model is a Nova-quantized FP8 version of meta-llama/Llama-3.2-1B-Instruct, optimized for high-throughput inference on H100 GPUs using FlashInfer.

Model Details

  • Model Type: llama
  • Architecture: LlamaForCausalLM
  • Use Case: Text
  • Quantization Method: Nova FP8 (E4M3)
  • Quantization Date: 2025-09-06
  • Compression Ratio: 2.00x
  • Quantized Layers: 112
  • Framework: FlashInfer-optimized for H100 Tensor Cores

Model Specifications

Parameter Value
Vocabulary Size 128,256
Hidden Size 2048
Intermediate Size 8192
Hidden Layers 16
Attention Heads 32
KV Heads 8
Max Position 131072
Activation silu
RoPE Theta 500000.0

Performance

Metric Value
Model Size Reduction 50.0%
Quantization Time 0.12972354888916016 seconds
Memory Usage 3.44 GB

Validation Results

  • Test Prompt: "Hello, my name is"
  • Validation Status: โœ… Passed
  • Forward Pass: Forward pass completed successfully

Usage

This model requires the Nova inference engine with FlashInfer support:

# Install Nova (requires CUDA 12.8.1+ and H100 GPU)
pip install nova-inference flashinfer

# Load and use the model
from nova import load_model

model = load_model("remodlai/Llama-3.2-1B-Instruct-nova-fp8")
output = model.generate("Your prompt here")

Limitations

  • Requires H100 GPUs with CUDA 12.8.1+
  • Only compatible with Nova inference engine
  • FlashInfer 0.2.14+ required
  • Not compatible with standard transformers library

Technical Details

Quantization Process

  1. Model capabilities detected using Nova's AutoConfig analysis framework
  2. Weights quantized to FP8 format using FlashInfer's to_float8() function
  3. Scales stored as reciprocals for optimal FlashInfer bmm_fp8 performance
  4. Weights transposed to column-major format as required by FlashInfer
  5. The lm_head layer preserved in BF16 for output quality

Configuration

The model config includes Nova-specific metadata:

  • nova_quant: true
  • nova_quant_version: "1.0"
  • quantization_config: Contains format, optimization flags, and source model reference

Detected Configuration

{
  "model_type": "llama",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "use_case": "text",
  "multimodal": false
}

Citation

If you use this model, please cite:

@software{nova2025,
  title = {Nova FP8 Quantization},
  author = {Remodl AI},
  year = {2025},
  url = {https://github.com/remodlai/nova}
}

License

This quantized model maintains the same license as the source model (unknown). Please refer to the original model card for detailed license information.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for remodlai/Llama-3.2-1B-Instruct-Nova-FP8

Quantized
(305)
this model