Q4KS

#4
by Alastar-Smith - opened

Hello!
Can we get Q4KS Qs, since Q4KM are too slow for me, it is a 1gb difference.
Or I can just download Bartowski's variant from https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF ?
Is there a difference between Bartowski and Unsloth Qs?

I too am wondering what exactly is different about the dynamic quants and if it is relevant to GGUF for llama.cpp or just the bnb-4bit for vLLM.

Let's look closer at some available information:

  1. The recent DeepSeek-R1 Unsloth Dynamic flavor quants labeled UD e.g. UD-Q2_K_XL do use a custom unslothai fork of llama.cpp's llama-quant.cpp though the modifications seem specific to DeepSeek-V3 MoE architecture. This is a great model btw if you have 96GB+ RAM and a single 16GB+ VRAM CUDA GPU for ktransformers
  2. The blog post methodology to create the above UD quant uses same Bartowski importance matrix when used on smaller quants.
  3. unsloth/QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf = 19.9GB
  4. bartowski/Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-Q4_K_M.gguf = 19.9GB
  • llm_load_print_meta: EOS token = 151645 '<|im_end|>'
  • llm_load_print_meta: PAD token = 151643 '<|endoftext|>'

So the unsloth blog post on QwQ-32B mentions an additional bug fix in the tokenizer which likely mostly effects fine tuning:

The EOS token is correct, but the PAD token should probably rather be "<|vision_pad|>"

"eos_token": "<|im_end|>",
"pad_token": "<|endoftext|>",

So from what I can tell given the file size is exactly the same, so you are probably fine using the bartowski quant if you need a specific Q4_K_S size. I've had good luck with bartowski's IQ4_XS which barely fits 32k context on my 3090TI's 24GB VRAM at over 30 tok/sec.

If you are fine-tuning, then pay closer attention to that pad_token bug fix which does not seem to exist in the bartowski quant that I have.

If you want the "dynamic quant", you likely have to use vLLM with the 22.5GB unsloth/QwQ-32B-unsloth-bnb-4bit something like this:

# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh

# https://docs.vllm.ai/en/stable/getting_started/quickstart.html
mkdir vllm
cd vllm/
uv venv ./venv --python=3.12 --python-preference=only-managed
source  venv/bin/activate
uv pip install vllm
vllm --version
# INFO 03-07 11:13:15 __init__.py:207] Automatically detected platform cuda.
# 0.7.3

# https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve
OMP_NUM_THREADS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve \
    unsloth/QwQ-32B-unsloth-bnb-4bit \
    --download-dir /mnt/raid/models/ \
    --load-format auto \
    --dtype auto \
    --kv-cache-dtype auto \
    --max-model-len 1024 \
    --host 127.0.0.1 \
    --port 8080

or check this guide for 2x GPUs https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit

fwiw i just tested the barwtowski Qwen_QwQ-32B-IQ4_XS.gguf 1-shot flappy bird using ik_llama.cpp fork and it looks good to me. Let us know how you work out! (It used ~14k context so you will want 16k minimum and preferably more).

Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.

Sign up or log in to comment