我的3090TI 24GB显存运行非常愉快!感谢开发团队!
哇,QwQ-32B 对于这样一个小模型来说真是令人印象深刻!我之前一直在依赖 R1 671B 的 UD-Q2_K_XL 量化模型,并通过 ktransformers 工具结合部分 CPU/GPU 显存卸载技术来应对 NUMA 节点问题,才勉强能重构我的 Python 应用。但现在,我居然可以直接将整个 QwQ-32B 的 IQ4_XS 模型(支持 32k 上下文长度)完整加载到 3090TI 的 24GB 显存中,并以超过 30 token/秒的速度运行!从初步测试来看,它在重构一个约 250 行的 Python LLM 聊天应用时 表现与之前方案相当。我将继续测试!感谢开发团队!
My 3090TI 24GB VRAM is very happy. Thank you.
Wow, QwQ-32B is impressive for such a small model. I've been relying on R1 671B UD-Q2_K_XL
quant partial CPU/GPU offload with ktransformers battling NUMA node issues just to refactor my python apps, but now I can load the entire QwQ-32B IQ4_XS
with 32k context into 3090TI 24GB VRAM and watch it rip at over 30 tok/sec. In my initial test it seems comparable at refactoring a ~250 line python LLM chat app. I'll keep testing! Thanks!
./llama-server \
--model "../models/bartowski/Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-IQ4_XS.gguf" \
--n-gpu-layers 65 \
--ctx-size 32768 \
--parallel 1 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--threads 16 \
--flash-attn \
--mlock \
--n-predict -1 \
--host 127.0.0.1 \
--port 8080
./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 4831 (5e43f104)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
uname -a
Linux bigfan 6.13.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 02 Feb 2025 01:02:29 +0000 x86_64 GNU/Linux
你好,请问你是使用VLLM部署的么?我最近也想要在本地显卡部署这个,请问有推荐的教程么?谢谢
你好,请问你是使用VLLM部署的么?我最近也想要在本地显卡部署这个,请问有推荐的教程么?谢谢
这个是Llamacpp部署的
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
# https://docs.vllm.ai/en/stable/getting_started/quickstart.html
mkdir vllm
cd vllm/
uv venv ./venv --python=3.12 --python-preference=only-managed
source venv/bin/activate
uv pip install vllm
vllm --version
# INFO 03-07 11:13:15 __init__.py:207] Automatically detected platform cuda.
# 0.7.3
# https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve
OMP_NUM_THREADS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve \
Qwen/QwQ-32B-AWQ \
--download-dir /mnt/raid/models/ \
--load-format auto \
--dtype auto \
--kv-cache-dtype auto \
--max-model-len 32768 \
--host 127.0.0.1 \
--port 8080
# VRAM 35.46GiB / 47.99GiB
# INFO 03-07 11:41:10 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%.
--tensor-parallel-size 2
bug: https://github.com/vllm-project/vllm/issues/14449