Sunflower-32B-FP8 / README.md

patrickcmd

Update README.md

ffd18a6 verified about 20 hours ago

preview code

raw

history blame contribute delete

6.23 kB

metadata

library_name: transformers
tags:
  - quantization
  - fp8
  - vllm
  - multilingual
  - text-generation
base_model: Sunbird/Sunflower-32B
model_type: llm
license: apache-2.0

Sunbird/Sunflower-32B-FP8

Model Overview

This is a quantized version of Sunbird/Sunflower-32B using the FP8_DYNAMIC quantization scheme. This model has been optimized for efficient inference while maintaining model quality.

🌻 Sunflower-32B is a multilingual language model developed by Sunbird AI for Ugandan languages. Built on the Qwen 3-32B architecture, the model supports translation and text generation across 31 Ugandan languages plus English. The model achieves the highest translation accuracy among evaluated models in 24 of 31 language pairs.

Developed by: Sunbird AI
Model type: Causal language model
Languages: 31 Ugandan languages + English (see language codes above)

Quantization Details

Quantization Method: 8-bit floating point quantization with dynamic scaling
Base Model: Sunbird/Sunflower-32B
Quantization Framework: llmcompressor (vLLM)
Memory Efficiency: ~50% reduction in model size
Performance: Excellent inference speed with minimal accuracy loss

Model Details

Model Description

This quantized model maintains the capabilities of the base 32B parameter model while offering significant improvements in memory usage and inference speed.

Developed by: Sunbird AI
Model type: Quantized Large Language Model
Language(s): Multilingual (with focus on East African languages)
License: Apache 2.0
Quantized from model: Sunbird/Sunflower-32B
Quantization scheme: FP8_DYNAMIC

Model Sources

Base Repository: Sunbird/Sunflower-32B
Quantization Tool: llmcompressor

Usage

Installation

First, install the required dependencies:

pip install vllm llmcompressor

Inference with vLLM

from vllm import LLM
from vllm.sampling_params import SamplingParams

# Load the quantized model
model = LLM("Sunbird/Sunflower-32B-FP8", enforce_eager=True)

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1000
)

# Method 1: Using chat template (recommended)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Translate to Luganda: Uganda is a landlocked country in East Africa."}
]

# Apply chat template to format the messages
formatted_prompt = model.get_tokenizer().apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

# Generate response
outputs = model.generate([formatted_prompt], sampling_params)
print(outputs[0].outputs[0].text)

# Method 2: Direct text generation
prompt = "Explain the importance of biodiversity:"
outputs = model.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Hardware Requirements

Recommended:

GPU: NVIDIA GPU with Compute Capability 7.0+ (V100, A100, H100, or RTX 30xx/40xx series)
VRAM: 24-32GB (quantized from 64GB for FP16)
System RAM: 32GB+

Minimum:

GPU: NVIDIA GPU with 16GB VRAM
System RAM: 16GB+

Performance

Memory Efficiency

This quantized model provides significant memory savings compared to the original FP16 model:

Original Model Size (FP16): ~64GB
Quantized Model Size (FP8): ~14GB
Memory Reduction: ~50% reduction in model size

Inference Speed

Excellent inference speed with minimal accuracy loss. The quantized model typically shows:

Faster token generation compared to FP16
Lower latency for first token
Higher throughput for batch inference

Note: Actual performance may vary based on hardware, batch size, and sequence length.

Quantization Process

This model was quantized using the llmcompressor library from vLLM. Here's how you can reproduce the quantization:

from llmcompressor import oneshot
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Quantization recipe for FP8_DYNAMIC
recipe = 
quant_scheme: FP8_DYNAMIC
quant_modifiers:
  QuantizationModifier:
    scheme: FP8_DYNAMIC

Apply quantization

oneshot(
    model="Sunbird/Sunflower-32B",
    recipe=recipe,
    output_dir="Sunbird/Sunflower-32B-FP8",
)

Limitations and Considerations

Quantization Trade-offs

While quantization provides significant benefits in memory and speed, users should be aware of:

Slight Quality Degradation: Some tasks may show minor performance differences compared to the full-precision model
Hardware Requirements: Optimal performance requires compatible NVIDIA GPUs
Framework Dependency: Currently optimized for vLLM inference

Recommended Use Cases

Best suited for:

Production deployments requiring efficient inference
Real-time applications needing low latency
Scenarios with limited GPU memory
Batch processing workloads

Consider full-precision model for:

Tasks requiring maximum accuracy
Research experiments with fine-tuning
Scenarios where memory is not a constraint

Citation

If you use this model, please cite both the base model and the quantization work:

BibTeX:

@misc{Sunbird_Sunflower_32B_FP8,
    title={{Sunflower-32B FP8 Quantized Model}},
    author={{Sunbird AI}},
    year={{2025}},
    publisher={{HuggingFace}},
    howpublished={{\url{https://huggingface.co/Sunbird/Sunflower-32B-FP8}}}
}

APA:

Sunbird AI. (2025). Sunflower-32B FP8 Quantized Model. HuggingFace. https://huggingface.co/Sunbird/Sunflower-32B-FP8

Model Card Contact

For questions or feedback about this quantized model:

Repository Issues: GitHub Issues
Email: [email protected]
Base Model: Sunbird/Sunflower-32B

Model card generated on 2025-10-09