Llama-3.2-1B-Instruct-Nova-FP8

This model is a Nova-quantized FP8 version of meta-llama/Llama-3.2-1B-Instruct, optimized for high-throughput inference on H100 GPUs using FlashInfer.

Model Details

Model Type: llama
Architecture: LlamaForCausalLM
Use Case: Text
Quantization Method: Nova FP8 (E4M3)
Quantization Date: 2025-09-06
Compression Ratio: 2.00x
Quantized Layers: 112
Framework: FlashInfer-optimized for H100 Tensor Cores

Model Specifications

Parameter	Value
Vocabulary Size	128,256
Hidden Size	2048
Intermediate Size	8192
Hidden Layers	16
Attention Heads	32
KV Heads	8
Max Position	131072
Activation	silu
RoPE Theta	500000.0

Performance

Metric	Value
Model Size Reduction	50.0%
Quantization Time	0.12972354888916016 seconds
Memory Usage	3.44 GB

Validation Results

Test Prompt: "Hello, my name is"
Validation Status: ✅ Passed
Forward Pass: Forward pass completed successfully

Usage

This model requires the Nova inference engine with FlashInfer support:

# Install Nova (requires CUDA 12.8.1+ and H100 GPU)
pip install nova-inference flashinfer

# Load and use the model
from nova import load_model

model = load_model("remodlai/Llama-3.2-1B-Instruct-nova-fp8")
output = model.generate("Your prompt here")

Limitations

Requires H100 GPUs with CUDA 12.8.1+
Only compatible with Nova inference engine
FlashInfer 0.2.14+ required
Not compatible with standard transformers library

Technical Details

Quantization Process

Model capabilities detected using Nova's AutoConfig analysis framework
Weights quantized to FP8 format using FlashInfer's to_float8() function
Scales stored as reciprocals for optimal FlashInfer bmm_fp8 performance
Weights transposed to column-major format as required by FlashInfer
The lm_head layer preserved in BF16 for output quality

Configuration

The model config includes Nova-specific metadata:

nova_quant: true
nova_quant_version: "1.0"
quantization_config: Contains format, optimization flags, and source model reference

Detected Configuration

{
  "model_type": "llama",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "use_case": "text",
  "multimodal": false
}

Citation

If you use this model, please cite:

@software{nova2025,
  title = {Nova FP8 Quantization},
  author = {Remodl AI},
  year = {2025},
  url = {https://github.com/remodlai/nova}
}

License

This quantized model maintains the same license as the source model (unknown). Please refer to the original model card for detailed license information.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for remodlai/Llama-3.2-1B-Instruct-Nova-FP8

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

(305)

this model