quant req: 256 GB RAM + 96 GB VRAM

#1
by whoisjeremylam - opened

This would be a perfect size :-)

so 356GiB / 666 GiB gives us a target close to 4bpw which should give good performance still. Perhaps an iq4_kss would turn out about right.

I'm redoing my bf16 safetensors -> bf16 GGUF currently based on some updated info from ik_llama.cpp here: https://github.com/ikawrakow/ik_llama.cpp/issues/651#issuecomment-3212864652

I'll re-upload the new imatrix using this new GGUF and use it for the remainder of the quants so hopefully I can quantize attn_(k|v)_b for faster TG without losing much quality now. I released the IQ5_K already though given it uses Q8_0 for those tensors anyway as it is basically a max quality quant faster and smaller than the full Q8_0.

If anyone uses the IQ5_K, i noticed I needed to pass llama-server --chat-template deepseek3 ... as it seems to be auto-detecting incorrectly which was causing loops hitting the /chat/completions endpoint. Seems to work fine with the correct template or you're doing your own chat template with ST etc with the /text/completions endpoint.

But yeah, should be a good size!

That sounds great! I just wanted to add that the 96 GB of VRAM is across two cards (A6000 & 48GB 4090D), which does use a little bit of overhead for CUDA graphs, plus inefficiency of VRAM usage due to size of layers.

In the past I've been using V3/R1 quants at Q3_K_M with good success. Not sure what the difference would be to an IQ4_KSS lol.

PS: my daily driver has been your DeepSeek-TNG-R1T2-Chimera-IQ3_KSquant. It's a great model!

Speaking of DeepSeek-TNG-R1T2-Chimera-IQ3_KS, I only just stumbled across the conversation here:

https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/discussions/2

I find it quite coincidental that I wasn't the only one that liked that model. I wonder if it was just good fortune thatIQ3_KS turned out to be a sweet spot in terms of trading off size and quality.

Fast forwarding to V3.1, I wonder if there be much difference in an IQ4_KSS vs IQ3_KS?

PS: There's a wealth of information there that I'll have to incorporate into my configuration...!

Yeah I have the same size vram and system ram, I'd be interested in a iq4_kss.

My biggest annoyance with these bigger models is the time it takes to load them into the ram! I guess a pcie5 nvme is in my upgrade list lol

Excited to see the perplexity differences of these quants -

Fast forwarding to V3.1, I wonder if there be much difference in an IQ4_KSS vs IQ3_KS?

Excited to see the perplexity differences of these quants -

Yeah I'm slowly working through some quants now that I have this sorted out (had to re-make my bf16 GGUF) (pinging @anikifoss as this might be relevent if you're cooking new MLA quants too)

I'm expecting perplexity data to trickle in as I cook and test, and I'll have a graph up sometime this weekend.

My current biggest question about is about some quants like IQ3_KS and this quantization tweaks PR624 so I have to cook a quant twice, measure perplexity twice, and then release whichever is "better" haha...

I'll likely have a few options in this 3~4bpw range with data so you can choose what fits best on your rig. I'm also trying to balance perplexity with TG speed. Definitely pull and rebuild ik_llama.cpp as it got some more updates especially for Zen5 and CPUs with avx_vnni real 512bit instructions for PP speed on CPU.

Okay, I've published the first perplexity graph and the IQ4_KSS is uploading now. The IQ3_KS is already there!

Will you make IQ4K? I was wondering if you have plans to mae a 4.7-4.9 bpw though I will try the closest match to this.

Downloading the iq4_kss now, I think it will just barely squeeze onto my system!

Also, that perplexity chart is very pretty, so linear lmao.

I'll update this comment once the download finishes with some performance numbers.

update 1: I'm going to have to settle for iq3, I just barely can't fit the iq4_kss, hitting OOM error due to the 256gb system ram I have... :c

llm_load_tensors:        CPU buffer size = 252954.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   497.11 MiB
llm_load_tensors:      CUDA0 buffer size = 19796.32 MiB
llm_load_tensors:      CUDA1 buffer size = 19769.05 MiB
llm_load_tensors:      CUDA2 buffer size = 19769.05 MiB
llm_load_tensors:      CUDA3 buffer size = 20104.91 MiB

@ubergarm , how did you convert to BF16? I've tried with mainline but there are issues. And I've tried with https://github.com/evshiron/llama.cpp, which worked but your imatrix isn't compatible with the BF16 it produced.

@Thireus

how did you convert to BF16? I've tried with mainline but there are issues.

I used the "triton-cpu" method outlined here in this mainline lcpp issue to use the original deepseek casting script and mainline convert script e.g.:

deepseek-ai fp8 safetensors -> fp8_cast_bf16.py -> bf16 safetensors -> mainline convert_hf_to_gguf.py -> bf16 GGUF -> imatrix

your imatrix isn't compatible with the BF16 it produced.

Give you used the evshiron method, you'll end up with the attn_kv_b MLA tensor. i do have an imatrix for that too as that is what I originally did. You can still download that looking at earlier commit/upload in this repo here: https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/blob/db827cc0e0f9c46db67f106840ac594294d6cd11/imatrix-DeepSeek-V3.1-Q8_0.dat

The reason I switched to the "mainline MLA style" for the first time is this comment/discussion: https://github.com/ikawrakow/ik_llama.cpp/issues/651#issuecomment-3212883303

The only difference really is that the evshiron/ik convert method will end up with attn_kv_b and ik_llama.cpp imatrix won't give data for attn_(k|v)_b tensors so you want to leave them at q8_0. The final file size will be a couple GB bigger too as you will have to leave attn_kv_b at q8_0 also so kind of have duplicated data.

If you use the cast+mainline convert or the new direct fp8 safetensors -> bf16 gguf PR from mainline with some changes possibly (bartowski did this successfully) https://github.com/ggml-org/llama.cpp/pull/14810 , you will end up without attn_kv_b and you will have imatrix data for attn_(k|v)_b, but will want to keep them close to q8_0 anyway probably given they are so small and fairly sensitive.

hmu if you get stuck on any parts, and looking forward to what your systematic approach to quantizing shows for this model!

@ubergarm thanks for sharing quantization tips, as usual!

@phakio

I'm going to have to settle for iq3, I just barely can't fit the iq4_kss, hitting OOM error due to the 256gb system ram I have... :c

Say no more, fam: ## IQ3_K 293.177 GiB (3.753 BPW) 😹

This one is a bit different, keeping full q8_0 attn/shexp/first 3 dense layers which will likely slow down TG a little, but should give about the best possible perplexity for the size.

After I run perplexity clean with no NaNs and do a quick vibe check will start uploading it!

Finally got a working quant running locally. Has anyone figured out how to turn on thinking with V3.1 and llama.cpp?

@anikifoss

Has anyone figured out how to turn on thinking with V3.1 and llama.cpp?

Right, some folks asked over here and I've been wondering the best way to handle that too...

My impression is given where the model expects the <thinking> or </thinking> token to appear it would require the client to use the /text/completion/ API endpoint of llama-server. The baked in deepseek3 template can't handle it psure.

Here are the options:

  1. Use llama-server and your own client like SillyTavern and use the /text/completion endpoint and inject the token as desired yourself in your own client side controlled chat template.
  2. Add code to ik_llama.cpp/llama.cpp with two new "chate templates" e.g. deepseek3-think and deepseek3-nothink that turn on or off to support the /chat/completion/ endpoint.

Not super elegant, but the existing heuristics to detect chat template from the jinja are not working great anymore given how big the chat templates have become causing false positives and requiring users to pass --chat-template deepseek3 already...

Any other ideas?? As right now I have a simple python client hitting the /chat/completion endpoint and not specifying anything which seems to default to nothinking... ??

Any other ideas?? As right now I have a simple python client hitting the /chat/completion endpoint and not specifying anything which seems to default to nothinking... ??

Yeah, this is what I'm seeing as well.

@ubergarm , I finally found my stupid mistake. I had forgotten to copy the tokenizer.json file...

cp ~/AI/huggingface/DeepSeek-V3.1/tokenizer.json ~/AI/DeepSeek-V3.1-bf16-safetensors/

Full commands:

# Install dependencies
apt install python3-dev python3-pip python3-venv python3-wheel python3-setuptools git acl netcat-openbsd cmake git-lfs # python-is-python3 sudo netcat

# Prepare env
mkdir -p ~/AI/fp8-to-bf16
cd ~/AI/fp8-to-bf16
uv venv ./venv --python 3.12 --python-preference=only-managed

# Activate env
source venv/bin/activate

# Clone llama.cpp for DeepSeek-V3.1
git clone https://github.com/ggml-org/llama.cpp --recursive
cd llama.cpp

# Build llama.cpp
cd ~/AI/fp8-to-bf16/llama.cpp
uv pip install -r requirements/requirements-convert_hf_to_gguf.txt --prerelease=allow --index-strategy unsafe-best-match
cmake -B build -DGGML_AVX=ON -DGGML_AVX2=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j16
cd ..

# Build triton-cpu
git clone https://github.com/triton-lang/triton-cpu --recursive
cd triton-cpu
uv pip install ninja cmake wheel setuptools pybind11

# Apply this patch - https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2865306085
nano -w CMakeLists.txt
---
#  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Werror -Wno-covered-switch-default -fvisibility=hidden")
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-covered-switch-default -fvisibility=hidden")
---
nano -w third_party/cpu/CMakeLists.txt
---
#find_package(dnnl CONFIG)
#if (dnnl_FOUND)
#...
#endif()
---

# Install dependencies
uv pip install -r python/requirements.txt
apt-get update
apt-get install -y ccache
apt-get install -y --no-install-recommends \
    zlib1g-dev \
    libxml2-dev \
    libssl-dev \
    libgmp-dev \
    libmpfr-dev

# Compile
MAX_JOBS=16 uv pip install -e python --no-build-isolation

# Be patient, "Preparing Packages" downloads a lot of stuff before build begins...
cd ..

# Download model
mkdir -p ~/AI/huggingface
cd ~/AI/huggingface
git lfs clone https://huggingface.co/deepseek-ai/DeepSeek-V3.1

# Additional requirements specific to DeepSeek-V3.1
cd ~/AI/fp8-to-bf16/llama.cpp
mkdir ~/AI/DeepSeek-V3.1-bf16-safetensors
python fp8_cast_bf16.py \
      --input-fp8-hf-path ~/AI/huggingface/DeepSeek-V3.1/ \
      --output-bf16-hf-path ~/AI/DeepSeek-V3.1-bf16-safetensors/ 2>&1 | tee -a fp8_cast_bf16-DeepSeek-V3.1.log
cp ~/AI/huggingface/DeepSeek-V3.1/config.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
cp ~/AI/huggingface/DeepSeek-V3.1/generation_config.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
cp ~/AI/huggingface/DeepSeek-V3.1/tokenizer_config.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
cp ~/AI/huggingface/DeepSeek-V3.1/tokenizer.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
cp ~/AI/huggingface/DeepSeek-V3.1/*.py ~/AI/DeepSeek-V3.1-bf16-safetensors/

# DeepSeek-V3.1
cd ~/AI/fp8-to-bf16/llama.cpp
ulimit -n 99999
mkdir -p ~/AI/DeepSeek-V3.1/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_SPLIT/
python convert_hf_to_gguf.py \
    --outtype bf16 \
    --outfile ~/AI/DeepSeek-V3.1/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_SPLIT/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_TENSOR \
    --no-tensor-first-split --split-max-tensors 1 \
    ~/AI/DeepSeek-V3.1-bf16-safetensors

@phakio

I'm going to have to settle for iq3, I just barely can't fit the iq4_kss, hitting OOM error due to the 256gb system ram I have... :c

Say no more, fam: ## IQ3_K 293.177 GiB (3.753 BPW) 😹

This one is a bit different, keeping full q8_0 attn/shexp/first 3 dense layers which will likely slow down TG a little, but should give about the best possible perplexity for the size.

After I run perplexity clean with no NaNs and do a quick vibe check will start uploading it!

you rock! replaced the IQ3_KSS with the IQ3_K and honestly the speed difference is only like 0.4 t/s, which is fine as I don't really use deepseek for coding, more so analytical work.
I noticed that v3.1 of deepseek seems less creative, more sterile by default... I had to re-prompt like 3 times to get it to provide an answer "from your point of view", like a "what would you do different in this scenario" type question... saw other reports of this as well on the main models discussions, seems like they are going the gpt-5 approach with making models a workhorse rather than an imaginary friend... less creativity = more accurate coding, but at the cost of less flair... better prompting needed by us! we need to really describe what we want done.

Sign up or log in to comment