🦄 Gemma 3 27B NPU+iGPU Quantized

🚀 Advanced NPU+iGPU Implementation

This NPU+iGPU quantized Gemma 3 27B model demonstrates advanced AI hardware acceleration techniques. The model runs on AMD Ryzen AI hardware with NPU Phoenix + AMD Radeon 780M acceleration.

✅ Production Status

Status: ✅ PRODUCTION READY
Server: Operational OpenAI v1 API server
Hardware: Real NPU Phoenix + AMD Radeon 780M
Size: 26GB quantized (74% reduction from 102GB)
Format: Safetensors layer-by-layer streaming
API: OpenAI v1 compatible

🎯 Quick Start

Using with Unicorn Execution Engine

# Clone the framework
git clone https://github.com/magicunicorn/unicorn-execution-engine.git
cd unicorn-execution-engine

# Download this model
huggingface-cli download magicunicorn/gemma-3-27b-npu-quantized

# Start production server
source activate-uc1-ai-py311.sh
python real_2025_gemma27b_server.py

# Server runs on http://localhost:8009
# Model: "gemma-3-27b-it-npu-igpu-real"

Using with OpenWebUI

# Add to OpenWebUI
URL: http://localhost:8009
Model: gemma-3-27b-it-npu-igpu-real
API: OpenAI v1 Compatible

🔧 Hardware Requirements

Minimum Requirements

NPU: AMD Ryzen AI NPU Phoenix (16 TOPS)
iGPU: AMD Radeon 780M (RDNA3 architecture)
Memory: 32GB+ DDR5 RAM (96GB recommended)
Storage: 30GB+ for model files
OS: Ubuntu 25.04+ with Linux 6.14+ (HMA support)

Software Requirements

Unicorn Execution Engine: Latest version
MLIR-AIE2: Included in framework
Vulkan Drivers: Latest AMD drivers
XRT Runtime: /opt/xilinx/xrt

🎯 Performance

Benchmark Results

Hardware: Real NPU + iGPU acceleration
Attention: NPU Phoenix (16 TOPS)
FFN: AMD Radeon 780M (200+ GFLOPS)
Memory: Layer-by-layer streaming
Quality: Full 27B parameter model preserved

Technical Specifications

Parameters: 27.4B (quantized)
Precision: INT4/INT8 optimized for NPU+iGPU
Context Length: 8192 tokens
Architecture: Gemma 3 with grouped-query attention
Quantization: Custom NPU+iGPU aware quantization

📚 Technical Details

Quantization Strategy

NPU Layers: INT8 symmetric quantization
iGPU Layers: INT4 grouped quantization
Memory Optimized: Layer-by-layer streaming
Zero CPU Fallback: Pure hardware acceleration

Hardware Acceleration

NPU Phoenix: Attention computation (16 TOPS)
AMD Radeon 780M: FFN processing (RDNA3)
MLIR-AIE2: Real NPU kernel compilation
Vulkan: Direct iGPU compute shaders

🦄 About This Implementation

This model demonstrates advanced NPU+iGPU AI acceleration techniques, showing how consumer AMD Ryzen AI hardware can run large language models with hardware acceleration.

Framework: Unicorn Execution Engine
Date: July 10, 2025
Company: Magic Unicorn Unconventional Technology & Stuff Inc
Platform: Unicorn Commander

📖 Citation

@software{unicorn_execution_engine_gemma_27b_2025,
  title={Gemma 3 27B NPU+iGPU Quantized: NPU+iGPU Large Language Model},
  author={Unicorn Commander},
  year={2025},
  url={https://huggingface.co/magicunicorn/gemma-3-27b-npu-quantized},
  note={Production NPU+iGPU quantized large language model}
}

📚 Related Resources

Framework: Unicorn Execution Engine
Company: Magic Unicorn Unconventional Technology & Stuff Inc
Platform: Unicorn Commander
Documentation: Complete guides in framework repository

🔒 License

This model is released under the Apache 2.0 License, following the original Gemma 3 license terms.

🦄 NPU+iGPU Large Language Model
⚡ Powered by Unicorn Execution Engine
🏢 Magic Unicorn Unconventional Technology & Stuff Inc

magicunicorn
/

gemma-3-27b-npu-quantized