--- library_name: transformers tags: - quantization - fp4a16 - vllm - multilingual - text-generation base_model: Sunbird/Sunflower-14B model_type: llm license: apache-2.0 --- # Sunbird/Sunflower-14B-FP4A16 ## Model Overview This is a quantized version of [Sunbird/Sunflower-14B](https://huggingface.co/Sunbird/Sunflower-14B) using the **NVFP4A16** quantization scheme. This model has been optimized for efficient inference while maintaining model quality. 🌻 Sunflower-14B is a multilingual language model developed by Sunbird AI for Ugandan languages. Built on the Qwen 3-14B architecture, the model supports translation and text generation across 31 Ugandan languages plus English. The model achieves the highest translation accuracy among evaluated models in 24 of 31 language pairs. - Developed by: Sunbird AI - Model type: Causal language model - Languages: 31 Ugandan languages + English (see language codes above) ### Quantization Details - **Quantization Method:** 4-bit floating point weights with 16-bit activations - **Base Model:** Sunbird/Sunflower-14B - **Quantization Framework:** llmcompressor (vLLM) - **Memory Efficiency:** ~75% reduction in model size - **Performance:** Optimized for NVIDIA GPUs with high throughput ## Model Details ### Model Description This quantized model maintains the capabilities of the base 14B parameter model while offering significant improvements in memory usage and inference speed. - **Developed by:** Sunbird AI - **Model type:** Quantized Large Language Model - **Language(s):** Multilingual (with focus on East African languages) - **License:** Apache 2.0 - **Quantized from model:** [Sunbird/Sunflower-14B](https://huggingface.co/Sunbird/Sunflower-14B) - **Quantization scheme:** NVFP4A16 ### Model Sources - **Base Repository:** [Sunbird/Sunflower-14B](https://huggingface.co/Sunbird/Sunflower-14B) - **Quantization Tool:** [llmcompressor](https://github.com/vllm-project/llm-compressor) ## Usage ### Installation First, install the required dependencies: ```bash pip install vllm llmcompressor ``` ### Inference with vLLM ```python from vllm import LLM from vllm.sampling_params import SamplingParams # Load the quantized model model = LLM("Sunbird/Sunflower-14B-FP4A16", enforce_eager=True) # Define sampling parameters sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1000 ) # Method 1: Using chat template (recommended) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Translate to Luganda: Uganda is a landlocked country in East Africa."} ] # Apply chat template to format the messages formatted_prompt = model.get_tokenizer().apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # Generate response outputs = model.generate([formatted_prompt], sampling_params) print(outputs[0].outputs[0].text) # Method 2: Direct text generation prompt = "Explain the importance of biodiversity:" outputs = model.generate([prompt], sampling_params) print(outputs[0].outputs[0].text) ``` ### Hardware Requirements **Recommended:** - GPU: NVIDIA GPU with Compute Capability 7.0+ (V100, A100, H100, or RTX 30xx/40xx series) - VRAM: 12-16GB (quantized from 28GB for FP16) - System RAM: 32GB+ **Minimum:** - GPU: NVIDIA GPU with 8GB VRAM - System RAM: 16GB+ ## Performance ### Memory Efficiency This quantized model provides significant memory savings compared to the original FP16 model: - **Original Model Size (FP16):** ~28GB - **Quantized Model Size (FP4A16):** ~7GB - **Memory Reduction:** ~75% reduction in model size ### Inference Speed Optimized for NVIDIA GPUs with high throughput. The quantized model typically shows: - Faster token generation compared to FP16 - Lower latency for first token - Higher throughput for batch inference *Note: Actual performance may vary based on hardware, batch size, and sequence length.* ## Quantization Process This model was quantized using the llmcompressor library from vLLM. Here's how you can reproduce the quantization: ```python from llmcompressor import oneshot from llmcompressor.transformers import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier # Quantization recipe for NVFP4A16 recipe = quant_scheme: NVFP4A16 quant_modifiers: QuantizationModifier: scheme: NVFP4A16 ``` # Apply quantization ``` oneshot( model="Sunbird/Sunflower-14B", recipe=recipe, output_dir="Sunbird/Sunflower-14B-FP4A16", ) ``` ## Limitations and Considerations ### Quantization Trade-offs While quantization provides significant benefits in memory and speed, users should be aware of: 1. **Slight Quality Degradation:** Some tasks may show minor performance differences compared to the full-precision model 2. **Hardware Requirements:** Optimal performance requires compatible NVIDIA GPUs 3. **Framework Dependency:** Currently optimized for vLLM inference ### Recommended Use Cases **Best suited for:** - Production deployments requiring efficient inference - Real-time applications needing low latency - Scenarios with limited GPU memory - Batch processing workloads **Consider full-precision model for:** - Tasks requiring maximum accuracy - Research experiments with fine-tuning - Scenarios where memory is not a constraint ## Citation If you use this model, please cite both the base model and the quantization work: **BibTeX:** ```bibtex @misc{Sunbird_Sunflower_14B_FP4A16, title={{Sunflower-14B FP4A16 Quantized Model}}, author={{Sunbird AI}}, year={{2025}}, publisher={{HuggingFace}}, howpublished={{\url{https://huggingface.co/Sunbird/Sunflower-14B-FP4A16}}} } ``` **APA:** Sunbird AI. (2025). *Sunflower-14B FP4A16 Quantized Model*. HuggingFace. https://huggingface.co/Sunbird/Sunflower-14B-FP4A16 ## Model Card Contact For questions or feedback about this quantized model: - **Repository Issues:** [GitHub Issues](https://github.com/SunbirdAI) - **Email:** info@sunbird.ai - **Base Model:** [Sunbird/Sunflower-14B](https://huggingface.co/Sunbird/Sunflower-14B) --- *Model card generated on 2025-10-09*