ankitkushwaha90/cuda_setup_for_gguf

nvidia-smi

pip install llama-cpp-python --force-reinstall --no-cache-dir --upgrade --verbose --config-settings=llama-cpp-python:extra_backend="cuda"

Thanks for sharing everything — you're very close to finishing this. Here's exactly what went wrong and how to fix it:

❌ The Problem

You're using:

✅ Option 2: Build llama.dll from source (if using custom CUDA or AVX2 options)

If you want to build it yourself:

Requirements:

Visual Studio 2022 (with C++ build tools)

set CMAKE_ARGS=-DLLAMA_CUBLAS=on

But this throws:

LLAMA_CUBLAS is deprecated and will be removed in the future. Use GGML_CUDA instead

✅ The Fix

You need to use -DGGML_CUDA=on instead of -DLLAMA_CUBLAS=on.

Here’s the correct and updated Windows CMD setup:

🔁 Step-by-Step Command

set CMAKE_ARGS=-DGGML_CUDA=on
set FORCE_CMAKE=1
pip install --force-reinstall --no-cache-dir llama-cpp-python

✅ Pre-Requirements Recap ✅ Visual Studio Build Tools (C++ workload) ✅ cmake, ninja, and setuptools installed:

pip install cmake ninja setuptools wheel

✅ CUDA installed (you have CUDA 12.9 — that's fine for RTX 4060)

🔍 After Installing

Check this path again:

C:\Users\Ankit-kushwaha90\miniconda3\envs\cyber_llm\Lib\site-packages\llama_cpp\lib

You should now see: llama.dll (the compiled shared library for GPU support)

✅ Finally: Run Your Script

Once llama.dll is present:

python token_app.py

Let me know if you want a minimal script to test the llama_cpp library separately before running your actual app.

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [29 lines of output]
      *** scikit-build-core 0.11.5 using CMake 4.0.3 (wheel)
      *** Configuring CMake...
      loading initial cache file C:\Users\ANKIT-~1\AppData\Local\Temp\tmpaz6_a8mn\build\CMakeInit.txt
      -- Building for: Visual Studio 17 2022
      -- The C compiler identification is MSVC 19.44.35213.0
      -- The CXX compiler identification is MSVC 19.44.35213.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Preview/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Preview/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.50.1.windows.1")
      CMake Error at vendor/llama.cpp/CMakeLists.txt:116 (message):
        LLAMA_CUBLAS is deprecated and will be removed in the future.

        Use GGML_CUDA instead

      Call Stack (most recent call first):
        vendor/llama.cpp/CMakeLists.txt:121 (llama_option_depr)


      -- Configuring incomplete, errors occurred!

      *** CMake configuration failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama-cpp-python)

D:\ai_model>
D:\ai_model>

before install visualstudio in cmake and MSVC and c++

Perfect! Since you're using NVIDIA CUDA, here's exactly how to run deepseek-llm-7b-chat-Q6_K.gguf using llama-cpp-python with full GPU acceleration via CUDA on Windows or Linux.

✅ 1. Install llama-cpp-python with CUDA (using NVIDIA)

This compiles the model backend using cuBLAS (NVIDIA CUDA).

🔹 In PowerShell:

$env:CMAKE_ARGS="-DGGML_CUDA=on"
pip install llama-cpp-python --force-reinstall --no-cache-dir

🔹 In CMD:

set CMAKE_ARGS=-DGGML_CUDA=on
pip install llama-cpp-python --force-reinstall --no-cache-dir

✅ 2. Run DeepSeek model on GPU with Python

Here’s the correct Python code to load the CUDA-enabled .gguf model:

from llama_cpp import Llama

llm = Llama(
    model_path="deepseek-llm-7b-chat-Q6_K.gguf",  # adjust path if needed
    n_gpu_layers=50,         # Number of layers to load on GPU
    n_ctx=4096,              # Context length
    use_mlock=True,          # Optional: lock memory for performance
    verbose=True             # Print layer-device map
)

# Example query
output = llm("Q: What is DeepSeek? A:", max_tokens=100)
print(output["choices"][0]["text"])

🔍 Output Should Show:

llama_model_load: using CUDA backend
llama_kv_cache_unified: layer 0: dev = GPU
llama_kv_cache_unified: layer 1: dev = GPU

... ❌ If it says dev = CPU → you didn’t compile with CUDA correctly or n_gpu_layers is still 0.

✅ 3. Make Sure You Have These Installed:

✅ Environment Setup Checklist for CUDA with `llama-cpp-python`

🚀 Required Tools

Tool	Command to Verify	Note
NVIDIA GPU	`nvidia-smi`	Shows driver and memory
CUDA Toolkit	`nvcc --version`	Should be 11.8 or newer
Visual Studio (Win)	Already installed ✔️	Needed for CMake + MSVC
CMake	`cmake --version`	Should be ≥ 3.26
Python 3.10+	`python --version`	Recommended: 3.10 or 3.11

🧠 GPU Layer Tuning (`n_gpu_layers`)

GPU (VRAM)	Suggested `n_gpu_layers`
6 GB	20–30
8 GB	35–50
12 GB+	60–80+

🧪 Tip: Start with a lower n_gpu_layers and increase until your VRAM is nearly full without causing OOM (Out of Memory) errors.

✅ (Optional) Run in CLI (C++ llama.cpp build) If you're using the C++ CLI version of llama.cpp:

./main -m deepseek-llm-7b-chat-Q6_K.gguf -ngl 50 -p "Q: What is DeepSeek? A:"

🔧 Troubleshooting

If still stuck on CPU:

Reinstall with GGML_CUDA=on again
Verify CUDA runtime is installed (nvcc, nvidia-smi)
Try lower n_gpu_layers (start with 10, then increase)