π¦ Gemma 3 27B NPU+iGPU Quantized
π Advanced NPU+iGPU Implementation
This NPU+iGPU quantized Gemma 3 27B model demonstrates advanced AI hardware acceleration techniques. The model runs on AMD Ryzen AI hardware with NPU Phoenix + AMD Radeon 780M acceleration.
β Production Status
- Status: β PRODUCTION READY
- Server: Operational OpenAI v1 API server
- Hardware: Real NPU Phoenix + AMD Radeon 780M
- Size: 26GB quantized (74% reduction from 102GB)
- Format: Safetensors layer-by-layer streaming
- API: OpenAI v1 compatible
π― Quick Start
Using with Unicorn Execution Engine
# Clone the framework
git clone https://github.com/magicunicorn/unicorn-execution-engine.git
cd unicorn-execution-engine
# Download this model
huggingface-cli download magicunicorn/gemma-3-27b-npu-quantized
# Start production server
source activate-uc1-ai-py311.sh
python real_2025_gemma27b_server.py
# Server runs on http://localhost:8009
# Model: "gemma-3-27b-it-npu-igpu-real"
Using with OpenWebUI
# Add to OpenWebUI
URL: http://localhost:8009
Model: gemma-3-27b-it-npu-igpu-real
API: OpenAI v1 Compatible
π§ Hardware Requirements
Minimum Requirements
- NPU: AMD Ryzen AI NPU Phoenix (16 TOPS)
- iGPU: AMD Radeon 780M (RDNA3 architecture)
- Memory: 32GB+ DDR5 RAM (96GB recommended)
- Storage: 30GB+ for model files
- OS: Ubuntu 25.04+ with Linux 6.14+ (HMA support)
Software Requirements
- Unicorn Execution Engine: Latest version
- MLIR-AIE2: Included in framework
- Vulkan Drivers: Latest AMD drivers
- XRT Runtime: /opt/xilinx/xrt
π― Performance
Benchmark Results
- Hardware: Real NPU + iGPU acceleration
- Attention: NPU Phoenix (16 TOPS)
- FFN: AMD Radeon 780M (200+ GFLOPS)
- Memory: Layer-by-layer streaming
- Quality: Full 27B parameter model preserved
Technical Specifications
- Parameters: 27.4B (quantized)
- Precision: INT4/INT8 optimized for NPU+iGPU
- Context Length: 8192 tokens
- Architecture: Gemma 3 with grouped-query attention
- Quantization: Custom NPU+iGPU aware quantization
π Technical Details
Quantization Strategy
- NPU Layers: INT8 symmetric quantization
- iGPU Layers: INT4 grouped quantization
- Memory Optimized: Layer-by-layer streaming
- Zero CPU Fallback: Pure hardware acceleration
Hardware Acceleration
- NPU Phoenix: Attention computation (16 TOPS)
- AMD Radeon 780M: FFN processing (RDNA3)
- MLIR-AIE2: Real NPU kernel compilation
- Vulkan: Direct iGPU compute shaders
π¦ About This Implementation
This model demonstrates advanced NPU+iGPU AI acceleration techniques, showing how consumer AMD Ryzen AI hardware can run large language models with hardware acceleration.
Framework: Unicorn Execution Engine
Date: July 10, 2025
Company: Magic Unicorn Unconventional Technology & Stuff Inc
Platform: Unicorn Commander
π Citation
@software{unicorn_execution_engine_gemma_27b_2025,
title={Gemma 3 27B NPU+iGPU Quantized: NPU+iGPU Large Language Model},
author={Unicorn Commander},
year={2025},
url={https://huggingface.co/magicunicorn/gemma-3-27b-npu-quantized},
note={Production NPU+iGPU quantized large language model}
}
π Related Resources
- Framework: Unicorn Execution Engine
- Company: Magic Unicorn Unconventional Technology & Stuff Inc
- Platform: Unicorn Commander
- Documentation: Complete guides in framework repository
π License
This model is released under the Apache 2.0 License, following the original Gemma 3 license terms.
π¦ NPU+iGPU Large Language Model
β‘ Powered by Unicorn Execution Engine
π’ Magic Unicorn Unconventional Technology & Stuff Inc
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Evaluation results
- Hardware Acceleration on NPU+iGPU Benchmarkself-reportedReal NPU+iGPU acceleration
- Model Size on NPU+iGPU Benchmarkself-reported26GB quantized (from 102GB original)