draft model not compatible with Qwen3 Coder

#1
by Chico70 - opened

Hi, Jukofyork.

I tried to run the Qwen3-0.6B-32k-Q4_0.gguf with the ubergarm/Qwen3-Coder-480B-A35B-Instruct-IQ4_K model at ik_llama.cpp and I received the following error msg:

llama_speculative_are_compatible: vocab_type tgt: 1
llama_speculative_are_compatible: vocab_type dft: 1
llama_speculative_are_compatible: draft vocab special tokens must match target vocab to use speculation
llama_speculative_are_compatible: tgt: bos = 11 (0), eos = 151645 (0)
llama_speculative_are_compatible: dft: bos = 151643 (0), eos = 151645 (0)
ERR [ load_model] the draft model is not compatible with the target model | tid="138495454056448" timestamp=1754833364

What did I do wrong?

Best regards, ChicoPinto.

Your llama.cpp needs updating. You should see something like this:

srv    load_model: the draft model 'draft_models/Qwen3-0.6B-64k-Q4_0.gguf' is not compatible with the target model 'models/Qwen3-Coder-480B-A35B-Instruct-Q6_K_X.gguf'. tokens will be translated between the draft and target models.

This was only merged fairly recently in this PR: llama-server : implement universal assisted decoding #12635 .

Sorry, I just saw you said ik_llama.cpp! I don't think ik_llama.cpp has added this feature yet :/

You can still create one using transplant-vocab, eg:

python ./transplant_vocab.py Qwen3-0.6B Qwen3-Coder-480B-A35B-Instruct Qwen3-Coder-480B-A35B-Instruct-DRAFT-0.75B

but I don't use ik_llama.cpp so can't really test it (sometimes you have to manually edit the config file(s) produced by transplant-vocab to get a perfect match, etc).

You should then be able to follow the instructions on this repo's readme.md and convert and quantize Q4_0 quants with longer context.

Thanks for your reply!!!

I'm using the ik_llama.cpp PR #645 for the speculative decoding. It works fine with your DeepSeek-R1-DRAFT-0.6B-32k-Q4_0 and Kimi-K2-Instruct-DRAFT-0.6B-32k-Q4_0 drafts, but it fails with this one.

Anyway, I'll try your transplant suggestion.

Thanks, again!

I created some for Qwen3-Coder here now:

https://huggingface.co/jukofyork/Qwen3-Coder-Instruct-DRAFT-0.75B-GGUF

They look to work OK for me and no "tokens will be translated between the draft and target models" message, so should hopefully work in ik_llama.cpp too.

I created some for Qwen3-Coder here now:

https://huggingface.co/jukofyork/Qwen3-Coder-Instruct-DRAFT-0.75B-GGUF

They look to work OK for me and no "tokens will be translated between the draft and target models" message, so should hopefully work in ik_llama.cpp too.

Great!!! I just see this reply now. I already test it and it works great! I even thank you for the new draft in its chat. Thank you, again!!!!

BTW, I also tested your GLM 4.5 draft and it also work fine with ik_llama.cpp. :-)

Sign up or log in to comment