nvidia-smi
pip install llama-cpp-python --force-reinstall --no-cache-dir --upgrade --verbose --config-settings=llama-cpp-python:extra_backend="cuda"
Thanks for sharing everything β you're very close to finishing this. Here's exactly what went wrong and how to fix it:
β The Problem
You're using:
β Option 2: Build llama.dll from source (if using custom CUDA or AVX2 options)
If you want to build it yourself:
Requirements:
Visual Studio 2022 (with C++ build tools)
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
But this throws:
LLAMA_CUBLAS is deprecated and will be removed in the future. Use GGML_CUDA instead
β The Fix
You need to use -DGGML_CUDA=on instead of -DLLAMA_CUBLAS=on.
Hereβs the correct and updated Windows CMD setup:
π Step-by-Step Command
set CMAKE_ARGS=-DGGML_CUDA=on
set FORCE_CMAKE=1
pip install --force-reinstall --no-cache-dir llama-cpp-python
β Pre-Requirements Recap β Visual Studio Build Tools (C++ workload) β cmake, ninja, and setuptools installed:
pip install cmake ninja setuptools wheel
β CUDA installed (you have CUDA 12.9 β that's fine for RTX 4060)
π After Installing
Check this path again:
C:\Users\Ankit-kushwaha90\miniconda3\envs\cyber_llm\Lib\site-packages\llama_cpp\lib
You should now see: llama.dll (the compiled shared library for GPU support)
β Finally: Run Your Script
Once llama.dll is present:
python token_app.py
Let me know if you want a minimal script to test the llama_cpp library separately before running your actual app.
Building wheels for collected packages: llama-cpp-python
Building wheel for llama-cpp-python (pyproject.toml) ... error
error: subprocess-exited-with-error
Γ Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
β exit code: 1
β°β> [29 lines of output]
*** scikit-build-core 0.11.5 using CMake 4.0.3 (wheel)
*** Configuring CMake...
loading initial cache file C:\Users\ANKIT-~1\AppData\Local\Temp\tmpaz6_a8mn\build\CMakeInit.txt
-- Building for: Visual Studio 17 2022
-- The C compiler identification is MSVC 19.44.35213.0
-- The CXX compiler identification is MSVC 19.44.35213.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Preview/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Preview/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.50.1.windows.1")
CMake Error at vendor/llama.cpp/CMakeLists.txt:116 (message):
LLAMA_CUBLAS is deprecated and will be removed in the future.
Use GGML_CUDA instead
Call Stack (most recent call first):
vendor/llama.cpp/CMakeLists.txt:121 (llama_option_depr)
-- Configuring incomplete, errors occurred!
*** CMake configuration failed
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama-cpp-python)
D:\ai_model>
D:\ai_model>
before install visualstudio in cmake and MSVC and c++
Perfect! Since you're using NVIDIA CUDA, here's exactly how to run deepseek-llm-7b-chat-Q6_K.gguf using llama-cpp-python with full GPU acceleration via CUDA on Windows or Linux.
β 1. Install llama-cpp-python with CUDA (using NVIDIA)
This compiles the model backend using cuBLAS (NVIDIA CUDA).
πΉ In PowerShell:
$env:CMAKE_ARGS="-DGGML_CUDA=on"
pip install llama-cpp-python --force-reinstall --no-cache-dir
πΉ In CMD:
set CMAKE_ARGS=-DGGML_CUDA=on
pip install llama-cpp-python --force-reinstall --no-cache-dir
β 2. Run DeepSeek model on GPU with Python
Hereβs the correct Python code to load the CUDA-enabled .gguf model:
from llama_cpp import Llama
llm = Llama(
model_path="deepseek-llm-7b-chat-Q6_K.gguf", # adjust path if needed
n_gpu_layers=50, # Number of layers to load on GPU
n_ctx=4096, # Context length
use_mlock=True, # Optional: lock memory for performance
verbose=True # Print layer-device map
)
# Example query
output = llm("Q: What is DeepSeek? A:", max_tokens=100)
print(output["choices"][0]["text"])
π Output Should Show:
llama_model_load: using CUDA backend
llama_kv_cache_unified: layer 0: dev = GPU
llama_kv_cache_unified: layer 1: dev = GPU
... β If it says dev = CPU β you didnβt compile with CUDA correctly or n_gpu_layers is still 0.
β 3. Make Sure You Have These Installed:
β
Environment Setup Checklist for CUDA with llama-cpp-python
π Required Tools
Tool | Command to Verify | Note |
---|---|---|
NVIDIA GPU | nvidia-smi |
Shows driver and memory |
CUDA Toolkit | nvcc --version |
Should be 11.8 or newer |
Visual Studio (Win) | Already installed βοΈ | Needed for CMake + MSVC |
CMake | cmake --version |
Should be β₯ 3.26 |
Python 3.10+ | python --version |
Recommended: 3.10 or 3.11 |
π§ GPU Layer Tuning (n_gpu_layers
)
GPU (VRAM) | Suggested n_gpu_layers |
---|---|
6 GB | 20β30 |
8 GB | 35β50 |
12 GB+ | 60β80+ |
π§ͺ Tip: Start with a lower
n_gpu_layers
and increase until your VRAM is nearly full without causing OOM (Out of Memory) errors.
β (Optional) Run in CLI (C++ llama.cpp build) If you're using the C++ CLI version of llama.cpp:
./main -m deepseek-llm-7b-chat-Q6_K.gguf -ngl 50 -p "Q: What is DeepSeek? A:"
π§ Troubleshooting
If still stuck on CPU:
Reinstall with GGML_CUDA=on again
Verify CUDA runtime is installed (nvcc, nvidia-smi)
Try lower n_gpu_layers (start with 10, then increase)