draft model not compatible with Qwen3 Coder
Hi, Jukofyork.
I tried to run the Qwen3-0.6B-32k-Q4_0.gguf with the ubergarm/Qwen3-Coder-480B-A35B-Instruct-IQ4_K model at ik_llama.cpp and I received the following error msg:
llama_speculative_are_compatible: vocab_type tgt: 1
llama_speculative_are_compatible: vocab_type dft: 1
llama_speculative_are_compatible: draft vocab special tokens must match target vocab to use speculation
llama_speculative_are_compatible: tgt: bos = 11 (0), eos = 151645 (0)
llama_speculative_are_compatible: dft: bos = 151643 (0), eos = 151645 (0)
ERR [ load_model] the draft model is not compatible with the target model | tid="138495454056448" timestamp=1754833364
What did I do wrong?
Best regards, ChicoPinto.
Your llama.cpp
needs updating. You should see something like this:
srv load_model: the draft model 'draft_models/Qwen3-0.6B-64k-Q4_0.gguf' is not compatible with the target model 'models/Qwen3-Coder-480B-A35B-Instruct-Q6_K_X.gguf'. tokens will be translated between the draft and target models.
This was only merged fairly recently in this PR: llama-server : implement universal assisted decoding #12635 .
Sorry, I just saw you said ik_llama.cpp
! I don't think ik_llama.cpp
has added this feature yet :/
You can still create one using transplant-vocab, eg:
python ./transplant_vocab.py Qwen3-0.6B Qwen3-Coder-480B-A35B-Instruct Qwen3-Coder-480B-A35B-Instruct-DRAFT-0.75B
but I don't use ik_llama.cpp
so can't really test it (sometimes you have to manually edit the config file(s) produced by transplant-vocab
to get a perfect match, etc).
You should then be able to follow the instructions on this repo's readme.md
and convert and quantize Q4_0
quants with longer context.
Thanks for your reply!!!
I'm using the ik_llama.cpp PR #645 for the speculative decoding. It works fine with your DeepSeek-R1-DRAFT-0.6B-32k-Q4_0 and Kimi-K2-Instruct-DRAFT-0.6B-32k-Q4_0 drafts, but it fails with this one.
Anyway, I'll try your transplant suggestion.
Thanks, again!
I created some for Qwen3-Coder
here now:
https://huggingface.co/jukofyork/Qwen3-Coder-Instruct-DRAFT-0.75B-GGUF
They look to work OK for me and no "tokens will be translated between the draft and target models" message, so should hopefully work in ik_llama.cpp
too.
I created some for
Qwen3-Coder
here now:https://huggingface.co/jukofyork/Qwen3-Coder-Instruct-DRAFT-0.75B-GGUF
They look to work OK for me and no "tokens will be translated between the draft and target models" message, so should hopefully work in
ik_llama.cpp
too.
Great!!! I just see this reply now. I already test it and it works great! I even thank you for the new draft in its chat. Thank you, again!!!!
BTW, I also tested your GLM 4.5 draft and it also work fine with ik_llama.cpp. :-)