πŸ¦„ Gemma 3 27B NPU+iGPU Quantized

πŸš€ Advanced NPU+iGPU Implementation

This NPU+iGPU quantized Gemma 3 27B model demonstrates advanced AI hardware acceleration techniques. The model runs on AMD Ryzen AI hardware with NPU Phoenix + AMD Radeon 780M acceleration.

βœ… Production Status

  • Status: βœ… PRODUCTION READY
  • Server: Operational OpenAI v1 API server
  • Hardware: Real NPU Phoenix + AMD Radeon 780M
  • Size: 26GB quantized (74% reduction from 102GB)
  • Format: Safetensors layer-by-layer streaming
  • API: OpenAI v1 compatible

🎯 Quick Start

Using with Unicorn Execution Engine

# Clone the framework
git clone https://github.com/magicunicorn/unicorn-execution-engine.git
cd unicorn-execution-engine

# Download this model
huggingface-cli download magicunicorn/gemma-3-27b-npu-quantized

# Start production server
source activate-uc1-ai-py311.sh
python real_2025_gemma27b_server.py

# Server runs on http://localhost:8009
# Model: "gemma-3-27b-it-npu-igpu-real"

Using with OpenWebUI

# Add to OpenWebUI
URL: http://localhost:8009
Model: gemma-3-27b-it-npu-igpu-real
API: OpenAI v1 Compatible

πŸ”§ Hardware Requirements

Minimum Requirements

  • NPU: AMD Ryzen AI NPU Phoenix (16 TOPS)
  • iGPU: AMD Radeon 780M (RDNA3 architecture)
  • Memory: 32GB+ DDR5 RAM (96GB recommended)
  • Storage: 30GB+ for model files
  • OS: Ubuntu 25.04+ with Linux 6.14+ (HMA support)

Software Requirements

  • Unicorn Execution Engine: Latest version
  • MLIR-AIE2: Included in framework
  • Vulkan Drivers: Latest AMD drivers
  • XRT Runtime: /opt/xilinx/xrt

🎯 Performance

Benchmark Results

  • Hardware: Real NPU + iGPU acceleration
  • Attention: NPU Phoenix (16 TOPS)
  • FFN: AMD Radeon 780M (200+ GFLOPS)
  • Memory: Layer-by-layer streaming
  • Quality: Full 27B parameter model preserved

Technical Specifications

  • Parameters: 27.4B (quantized)
  • Precision: INT4/INT8 optimized for NPU+iGPU
  • Context Length: 8192 tokens
  • Architecture: Gemma 3 with grouped-query attention
  • Quantization: Custom NPU+iGPU aware quantization

πŸ“š Technical Details

Quantization Strategy

  • NPU Layers: INT8 symmetric quantization
  • iGPU Layers: INT4 grouped quantization
  • Memory Optimized: Layer-by-layer streaming
  • Zero CPU Fallback: Pure hardware acceleration

Hardware Acceleration

  • NPU Phoenix: Attention computation (16 TOPS)
  • AMD Radeon 780M: FFN processing (RDNA3)
  • MLIR-AIE2: Real NPU kernel compilation
  • Vulkan: Direct iGPU compute shaders

πŸ¦„ About This Implementation

This model demonstrates advanced NPU+iGPU AI acceleration techniques, showing how consumer AMD Ryzen AI hardware can run large language models with hardware acceleration.

Framework: Unicorn Execution Engine
Date: July 10, 2025
Company: Magic Unicorn Unconventional Technology & Stuff Inc
Platform: Unicorn Commander

πŸ“– Citation

@software{unicorn_execution_engine_gemma_27b_2025,
  title={Gemma 3 27B NPU+iGPU Quantized: NPU+iGPU Large Language Model},
  author={Unicorn Commander},
  year={2025},
  url={https://huggingface.co/magicunicorn/gemma-3-27b-npu-quantized},
  note={Production NPU+iGPU quantized large language model}
}

πŸ“š Related Resources

πŸ”’ License

This model is released under the Apache 2.0 License, following the original Gemma 3 license terms.


πŸ¦„ NPU+iGPU Large Language Model
⚑ Powered by Unicorn Execution Engine
🏒 Magic Unicorn Unconventional Technology & Stuff Inc

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • Hardware Acceleration on NPU+iGPU Benchmark
    self-reported
    Real NPU+iGPU acceleration
  • Model Size on NPU+iGPU Benchmark
    self-reported
    26GB quantized (from 102GB original)