YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
nvidia-smi
pip install llama-cpp-python --force-reinstall --no-cache-dir --upgrade --verbose --config-settings=llama-cpp-python:extra_backend="cuda"

Thanks for sharing everything β€” you're very close to finishing this. Here's exactly what went wrong and how to fix it:

❌ The Problem

You're using:

βœ… Option 2: Build llama.dll from source (if using custom CUDA or AVX2 options)

If you want to build it yourself:

Requirements:

Visual Studio 2022 (with C++ build tools)
set CMAKE_ARGS=-DLLAMA_CUBLAS=on

But this throws:

LLAMA_CUBLAS is deprecated and will be removed in the future. Use GGML_CUDA instead

βœ… The Fix

You need to use -DGGML_CUDA=on instead of -DLLAMA_CUBLAS=on.

Here’s the correct and updated Windows CMD setup:

πŸ” Step-by-Step Command

set CMAKE_ARGS=-DGGML_CUDA=on
set FORCE_CMAKE=1
pip install --force-reinstall --no-cache-dir llama-cpp-python

βœ… Pre-Requirements Recap βœ… Visual Studio Build Tools (C++ workload) βœ… cmake, ninja, and setuptools installed:

pip install cmake ninja setuptools wheel

βœ… CUDA installed (you have CUDA 12.9 β€” that's fine for RTX 4060)

πŸ” After Installing

Check this path again:

C:\Users\Ankit-kushwaha90\miniconda3\envs\cyber_llm\Lib\site-packages\llama_cpp\lib

You should now see: llama.dll (the compiled shared library for GPU support)

βœ… Finally: Run Your Script

Once llama.dll is present:

python token_app.py

Let me know if you want a minimal script to test the llama_cpp library separately before running your actual app.

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  Γ— Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  β”‚ exit code: 1
  ╰─> [29 lines of output]
      *** scikit-build-core 0.11.5 using CMake 4.0.3 (wheel)
      *** Configuring CMake...
      loading initial cache file C:\Users\ANKIT-~1\AppData\Local\Temp\tmpaz6_a8mn\build\CMakeInit.txt
      -- Building for: Visual Studio 17 2022
      -- The C compiler identification is MSVC 19.44.35213.0
      -- The CXX compiler identification is MSVC 19.44.35213.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Preview/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Preview/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.50.1.windows.1")
      CMake Error at vendor/llama.cpp/CMakeLists.txt:116 (message):
        LLAMA_CUBLAS is deprecated and will be removed in the future.

        Use GGML_CUDA instead

      Call Stack (most recent call first):
        vendor/llama.cpp/CMakeLists.txt:121 (llama_option_depr)


      -- Configuring incomplete, errors occurred!

      *** CMake configuration failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama-cpp-python)

D:\ai_model>
D:\ai_model>

before install visualstudio in cmake and MSVC and c++

Perfect! Since you're using NVIDIA CUDA, here's exactly how to run deepseek-llm-7b-chat-Q6_K.gguf using llama-cpp-python with full GPU acceleration via CUDA on Windows or Linux.

βœ… 1. Install llama-cpp-python with CUDA (using NVIDIA)

This compiles the model backend using cuBLAS (NVIDIA CUDA).

πŸ”Ή In PowerShell:

$env:CMAKE_ARGS="-DGGML_CUDA=on"
pip install llama-cpp-python --force-reinstall --no-cache-dir

πŸ”Ή In CMD:

set CMAKE_ARGS=-DGGML_CUDA=on
pip install llama-cpp-python --force-reinstall --no-cache-dir

βœ… 2. Run DeepSeek model on GPU with Python

Here’s the correct Python code to load the CUDA-enabled .gguf model:

from llama_cpp import Llama

llm = Llama(
    model_path="deepseek-llm-7b-chat-Q6_K.gguf",  # adjust path if needed
    n_gpu_layers=50,         # Number of layers to load on GPU
    n_ctx=4096,              # Context length
    use_mlock=True,          # Optional: lock memory for performance
    verbose=True             # Print layer-device map
)

# Example query
output = llm("Q: What is DeepSeek? A:", max_tokens=100)
print(output["choices"][0]["text"])

πŸ” Output Should Show:

llama_model_load: using CUDA backend
llama_kv_cache_unified: layer 0: dev = GPU
llama_kv_cache_unified: layer 1: dev = GPU

... ❌ If it says dev = CPU β†’ you didn’t compile with CUDA correctly or n_gpu_layers is still 0.

βœ… 3. Make Sure You Have These Installed:

βœ… Environment Setup Checklist for CUDA with llama-cpp-python

πŸš€ Required Tools

Tool Command to Verify Note
NVIDIA GPU nvidia-smi Shows driver and memory
CUDA Toolkit nvcc --version Should be 11.8 or newer
Visual Studio (Win) Already installed βœ”οΈ Needed for CMake + MSVC
CMake cmake --version Should be β‰₯ 3.26
Python 3.10+ python --version Recommended: 3.10 or 3.11

🧠 GPU Layer Tuning (n_gpu_layers)

GPU (VRAM) Suggested n_gpu_layers
6 GB 20–30
8 GB 35–50
12 GB+ 60–80+

πŸ§ͺ Tip: Start with a lower n_gpu_layers and increase until your VRAM is nearly full without causing OOM (Out of Memory) errors.

βœ… (Optional) Run in CLI (C++ llama.cpp build) If you're using the C++ CLI version of llama.cpp:

./main -m deepseek-llm-7b-chat-Q6_K.gguf -ngl 50 -p "Q: What is DeepSeek? A:"

πŸ”§ Troubleshooting

If still stuck on CPU:

  • Reinstall with GGML_CUDA=on again

  • Verify CUDA runtime is installed (nvcc, nvidia-smi)

  • Try lower n_gpu_layers (start with 10, then increase)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support