我的3090TI 24GB显存运行非常愉快!感谢开发团队!

#11
by ubergarm - opened

哇,QwQ-32B 对于这样一个小模型来说真是令人印象深刻!我之前一直在依赖 R1 671B 的 UD-Q2_K_XL 量化模型,并通过 ktransformers 工具结合部分 CPU/GPU 显存卸载技术来应对 NUMA 节点问题,才勉强能重构我的 Python 应用。但现在,我居然可以直接将整个 QwQ-32B 的 IQ4_XS 模型(支持 32k 上下文长度)完整加载到 3090TI 的 24GB 显存中,并以超过 30 token/秒的速度运行!从初步测试来看,它在重构一个约 250 行的 Python LLM 聊天应用时 表现与之前方案相当。我将继续测试!感谢开发团队!


My 3090TI 24GB VRAM is very happy. Thank you.

Wow, QwQ-32B is impressive for such a small model. I've been relying on R1 671B UD-Q2_K_XL quant partial CPU/GPU offload with ktransformers battling NUMA node issues just to refactor my python apps, but now I can load the entire QwQ-32B IQ4_XS with 32k context into 3090TI 24GB VRAM and watch it rip at over 30 tok/sec. In my initial test it seems comparable at refactoring a ~250 line python LLM chat app. I'll keep testing! Thanks!

./llama-server \
    --model "../models/bartowski/Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-IQ4_XS.gguf" \
    --n-gpu-layers 65 \
    --ctx-size 32768 \
    --parallel 1 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 16 \
    --flash-attn \
    --mlock \
    --n-predict -1 \
    --host 127.0.0.1 \
    --port 8080

./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 4831 (5e43f104)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu

uname -a
Linux bigfan 6.13.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 02 Feb 2025 01:02:29 +0000 x86_64 GNU/Linux

你好,请问你是使用VLLM部署的么?我最近也想要在本地显卡部署这个,请问有推荐的教程么?谢谢

你好,请问你是使用VLLM部署的么?我最近也想要在本地显卡部署这个,请问有推荐的教程么?谢谢

这个是Llamacpp部署的

@sghn @yangyangjuanjuan

# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh

# https://docs.vllm.ai/en/stable/getting_started/quickstart.html
mkdir vllm
cd vllm/
uv venv ./venv --python=3.12 --python-preference=only-managed
source  venv/bin/activate
uv pip install vllm
vllm --version
# INFO 03-07 11:13:15 __init__.py:207] Automatically detected platform cuda.
# 0.7.3

# https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve
OMP_NUM_THREADS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve \
    Qwen/QwQ-32B-AWQ \
    --download-dir /mnt/raid/models/ \
    --load-format auto \
    --dtype auto \
    --kv-cache-dtype auto \
    --max-model-len 32768 \
    --host 127.0.0.1 \
    --port 8080

# VRAM 35.46GiB / 47.99GiB
# INFO 03-07 11:41:10 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%.

--tensor-parallel-size 2 bug: https://github.com/vllm-project/vllm/issues/14449

Sign up or log in to comment