Llama-3.2-1B-Instruct-Nova-FP8
This model is a Nova-quantized FP8 version of meta-llama/Llama-3.2-1B-Instruct, optimized for high-throughput inference on H100 GPUs using FlashInfer.
Model Details
- Model Type: llama
- Architecture: LlamaForCausalLM
- Use Case: Text
- Quantization Method: Nova FP8 (E4M3)
- Quantization Date: 2025-09-06
- Compression Ratio: 2.00x
- Quantized Layers: 112
- Framework: FlashInfer-optimized for H100 Tensor Cores
Model Specifications
Parameter | Value |
---|---|
Vocabulary Size | 128,256 |
Hidden Size | 2048 |
Intermediate Size | 8192 |
Hidden Layers | 16 |
Attention Heads | 32 |
KV Heads | 8 |
Max Position | 131072 |
Activation | silu |
RoPE Theta | 500000.0 |
Performance
Metric | Value |
---|---|
Model Size Reduction | 50.0% |
Quantization Time | 0.12972354888916016 seconds |
Memory Usage | 3.44 GB |
Validation Results
- Test Prompt: "Hello, my name is"
- Validation Status: โ Passed
- Forward Pass: Forward pass completed successfully
Usage
This model requires the Nova inference engine with FlashInfer support:
# Install Nova (requires CUDA 12.8.1+ and H100 GPU)
pip install nova-inference flashinfer
# Load and use the model
from nova import load_model
model = load_model("remodlai/Llama-3.2-1B-Instruct-nova-fp8")
output = model.generate("Your prompt here")
Limitations
- Requires H100 GPUs with CUDA 12.8.1+
- Only compatible with Nova inference engine
- FlashInfer 0.2.14+ required
- Not compatible with standard transformers library
Technical Details
Quantization Process
- Model capabilities detected using Nova's AutoConfig analysis framework
- Weights quantized to FP8 format using FlashInfer's
to_float8()
function - Scales stored as reciprocals for optimal FlashInfer
bmm_fp8
performance - Weights transposed to column-major format as required by FlashInfer
- The
lm_head
layer preserved in BF16 for output quality
Configuration
The model config includes Nova-specific metadata:
nova_quant
: truenova_quant_version
: "1.0"quantization_config
: Contains format, optimization flags, and source model reference
Detected Configuration
{
"model_type": "llama",
"architectures": [
"LlamaForCausalLM"
],
"use_case": "text",
"multimodal": false
}
Citation
If you use this model, please cite:
@software{nova2025,
title = {Nova FP8 Quantization},
author = {Remodl AI},
year = {2025},
url = {https://github.com/remodlai/nova}
}
License
This quantized model maintains the same license as the source model (unknown). Please refer to the original model card for detailed license information.
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for remodlai/Llama-3.2-1B-Instruct-Nova-FP8
Base model
meta-llama/Llama-3.2-1B-Instruct