A 0.75B parameter draft (speculative decoding) model for use with Qwen3-Coder-480B-A35B-Instruct.


I've included the Q4_0 quants for 4 different YaRN extended context lengths:

NOTE: Because llama.cpp uses "static-YaRN" the scaling factor remains constant regardless of input length:

  • Only use the YaRN extended versions when processing long contexts is required.
  • Use the smallest YaRN-extension possible.

How these were created

1. The initial model was created from Qwen/Qwen3-0.6B using transplant-vocab:

> python3 transplant_vocab.py Qwen3-0.6B Qwen3-Coder-480B-A35B-Instruct Qwen3-Coder-Instruct-DRAFT-0.75B

Loading config from 'Qwen3-0.6B'... Done.
Loading config from 'Qwen3-Coder-480B-A35B-Instruct'... Done.
Loading tokenizer from 'Qwen3-0.6B'... Done.
Loading tokenizer from 'Qwen3-Coder-480B-A35B-Instruct'... Done.
Loading model from 'Qwen3-0.6B'... Done.

Input model configuration:
- Target vocabulary size    : 151936 (used = 151669, unused = 267)
- Donor vocabulary size     : 151936
- Donor num layers          : 28 (tied embeddings = True)
- Donor hidden size         : 1024
- Donor attention heads     : 16
- Donor intermediate size   : 3072 (ratio = 1:3.0)
- Donor total parameters    : 596049920 (0.60B)
-- Embedding parameters     : 155582464 (0.16B)
-- Non-embedding parameters : 440467456 (0.44B)

Processing 3 automatic token overrides:
✘ 'bos_token_id' : Not found for target model
βœ” 'eos_token_id' : 151645 '<|im_end|>' β†’ [151645] '<|im_end|>'
βœ” 'pad_token_id' : 151643 '<|endoftext|>' β†’ [151643] '<|endoftext|>'

NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...

Transplanting tokens: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 151669/151669 [00:40<00:00, 3751.23token/s]

Transplant mappings:
- 1 to 1  : 149829 (99%)
- 2 to 1  : 816 (0.54%)
- 3 to 1  : 506 (0.33%)
- 4 to 1  : 331 (0.22%)
- 5 to 1  : 118 (0.078%)
- 6 to 1  : 38 (0.025%)
- 7 to 1  : 22 (0.015%)
- 8 to 1  : 8 (0.0053%)
- 9 to 1  : 1 (0.00066%)

Head initialized with:
- Copies : 149829 (99%)
- Means  : 1840 (1.2%)
- Zeros  : 267 (0.18%)

Output model configuration:
- Output vocabulary size    : 151936
- Output num layers         : 28 (tied embeddings = False)
- Output hidden size        : 1024
- Output attention heads    : 16
- Output intermediate size  : 3072 (ratio = 1:3.0)
- Output total parameters   : 751632384 (0.75B)
-- Embedding parameters     : 311164928 (0.31B)
-- Non-embedding parameters : 440467456 (0.44B)

Saving model and tokenizer to 'Qwen3-Coder-Instruct-DRAFT-0.75B' folder

Patching 'torch_dtype' in 'Qwen3-Coder-Instruct-DRAFT-0.75B/config.json' based on actual saved tensors
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype

Operation completed successfully (ignore any 'segmentation fault' that follows!!!)

NOTE: No subsequent fine-tuning has been performed (due to the 99% "1 to 1" mapping...).

2. The context was extended using YaRN:

  "max_position_embeddings": 65536,
  ...
  "rope_scaling": {
    "factor": 2.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },

3. Converted and quantized:

./llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile Qwen3-Coder-Instruct-DRAFT-0.75B-64k-BF16.gguf Qwen3-Coder-Instruct-DRAFT-0.75B
./llama.cpp/build/bin/llama-quantize Qwen3-Coder-Instruct-DRAFT-0.75B-64k-BF16.gguf Qwen3-Coder-Instruct-DRAFT-0.75B-64k-Q4_0.gguf Q4_0 44

See here for information on how to patch the GGUF files for other context lengths.

Downloads last month
375
GGUF
Model size
752M params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jukofyork/Qwen3-Coder-Instruct-DRAFT-0.75B-GGUF

Finetuned
Qwen/Qwen3-0.6B
Quantized
(144)
this model

Collection including jukofyork/Qwen3-Coder-Instruct-DRAFT-0.75B-GGUF