ONNX Runtime GPU 1.24.0 - CUDA 13.0 Build with Blackwell Support

Overview

Custom-built ONNX Runtime GPU 1.24.0 for Windows with full CUDA 13.0 and Blackwell architecture (sm_120) support. This build addresses the cudaErrorNoKernelImageForDevice error that occurs with RTX 5060 Ti and other Blackwell-generation GPUs when using official PyPI distributions.

Build Specifications

Environment

OS: Windows 10/11 x64
CUDA Toolkit: 13.0
cuDNN: 9.13 (CUDA 13.0 compatible)
Visual Studio: 2022 (v17.x) with Desktop development with C++
Python: 3.13
CMake: 3.26+

Supported GPU Architectures

sm_89: Ada Lovelace (RTX 4060, 4070, etc.)
sm_90: Ada Lovelace High-end (RTX 4090) / Hopper (H100)
sm_120: Blackwell (RTX 5060 Ti, 5080, 5090)

Build Configuration

CMAKE_CUDA_ARCHITECTURES=89;90;120
onnxruntime_USE_FLASH_ATTENTION=OFF
CUDA_VERSION=13.0

Note: Flash Attention is disabled because ONNX Runtime 1.24.0's Flash Attention kernels are sm_80-specific and incompatible with sm_90/sm_120 architectures.

Installation

pip install onnxruntime_gpu-1.24.0-cp313-cp313-win_amd64.whl

Verify Installation

import onnxruntime as ort
print(f"Version: {ort.__version__}")
print(f"Providers: {ort.get_available_providers()}")
# Expected output: ['CUDAExecutionProvider', 'CPUExecutionProvider']

Key Features

✅ Blackwell GPU Support: Full compatibility with RTX 5060 Ti, 5080, 5090
✅ CUDA 13.0 Optimized: Built with latest CUDA toolkit for optimal performance
✅ Multi-Architecture: Single build supports Ada Lovelace and Blackwell
✅ Stable for Inference: Tested with WD14Tagger, Stable Diffusion pipelines

Known Limitations

⚠️ Flash Attention Disabled: Due to sm_80-only kernel implementation in ONNX Runtime 1.24.0, Flash Attention is not available. This has minimal impact on most inference workloads (e.g., WD14Tagger, image generation models).

⚠️ Windows Only: This build is specifically for Windows x64. Linux users should build from source with similar configurations.

Performance

Compared to CPU-only execution:

Image tagging (WD14Tagger): 10-50x faster
Inference latency: Significant reduction on GPU-accelerated operations
Memory: Efficiently utilizes 16GB VRAM on RTX 5060 Ti

Use Cases

ComfyUI: WD14Tagger nodes
Stable Diffusion Forge: ONNX-based models
General ONNX Model Inference: Any ONNX model requiring CUDA acceleration

Technical Background

Why This Build is Necessary

Official ONNX Runtime GPU distributions (PyPI) are typically built for older CUDA versions (11.x/12.x) and do not include sm_120 (Blackwell) architecture support. When running inference on Blackwell GPUs with official builds, users encounter:

cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

This custom build resolves the issue by:

Compiling with CUDA 13.0
Explicitly targeting sm_89, sm_90, sm_120
Disabling incompatible Flash Attention kernels

Flash Attention Status

ONNX Runtime's Flash Attention implementation currently only supports:

sm_80: Ampere (A100, RTX 3090)
Kernels are hardcoded with *_sm80.cu file naming

Future ONNX Runtime versions may add sm_90/sm_120 support, but as of 1.24.0, this remains unavailable.

Build Script

For those who want to replicate this build:

build.bat ^
  --config Release ^
  --build_shared_lib ^
  --parallel ^
  --use_cuda ^
  --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0" ^
  --cudnn_home "C:\Program Files\NVIDIA\CUDNN\v9.13" ^
  --cuda_version=13.0 ^
  --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="89;90;120" ^
                         CUDNN_INCLUDE_DIR="C:\Program Files\NVIDIA\CUDNN\v9.13\include\13.0" ^
                         CUDNN_LIBRARY="C:\Program Files\NVIDIA\CUDNN\v9.13\lib\13.0\x64\cudnn.lib" ^
                         onnxruntime_USE_FLASH_ATTENTION=OFF ^
  --build_wheel ^
  --skip_tests

Credits

Built by @ussoewwin for the community facing Blackwell GPU compatibility issues with ONNX Runtime.

License

Apache 2.0 (same as ONNX Runtime)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support