Draft Models
Collection
Tiny "draft" models for speculative decoding.
•
32 items
•
Updated
•
2
A 0.6B
parameter draft (speculative decoding) model for use with DeepSeek-V3-0324 and DeepSeek-V3.
See DeepSeek-V3-DRAFT-0.6B-v3.0-GGUF for the models in gguf
format for use with llama.cpp
.
The current config.json
is set for context length up to 32k tokens. Add the "rope_scaling"
section to config.json
to enable YaRN, eg:
"max_position_embeddings": 65536,
...
"rope_scaling": {
"factor": 2.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
"max_position_embeddings": 131072,
...
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
"max_position_embeddings": 163840,
...
"rope_scaling": {
"factor": 5.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
NOTE: Because llama.cpp
uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the rope_scaling
configuration when processing long contexts is required...
> python ./transplant_vocab.py \
./Qwen2.5-0.5B-Instruct \
./DeepSeek-V3-0324 \
./DeepSeek-V3-DRAFT-0.6B-UNTRAINED \
--override "<|begin▁of▁sentence|>" "<|endoftext|>" \
--override "<|end▁of▁sentence|>" "<|im_end|>" \
--override "<|▁pad▁|>" "<|endoftext|>" \
--override "<|fim▁hole|>" "<|fim_middle|>" \
--override "<|fim▁begin|>" "<|fim_prefix|>" \
--override "<|fim▁end|>" "<|fim_suffix|>" \
--override "<|User|>" "<|im_start|>user\\n" \
--override "<|Assistant|>" "<|im_start|>assistant\\n" \
--override "<|EOT|>" "<|endoftext|>" \
--override "<|tool▁calls▁begin|>" "<tool_call>" \
--override "<|tool▁calls▁end|>" "</tool_call>" \
--override "<|tool▁call▁begin|>" "<tool_call>" \
--override "<|tool▁call▁end|>" "</tool_call>" \
--override "<|tool▁outputs▁begin|>" "<tool_response>" \
--override "<|tool▁outputs▁end|>" "</tool_response>" \
--override "<|tool▁output▁begin|>" "<tool_response>" \
--override "<|tool▁output▁end|>" "</tool_response>" \
--override "<|tool▁sep|>" "</tool_call>"
Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
Loading config from 'DeepSeek-V3-0324'... Done.
Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
Loading tokenizer from 'DeepSeek-V3-0324'... Done.
Loading model from 'Qwen2.5-0.5B-Instruct'... Done.
Input model configuration:
- Target vocabulary size : 129280 (used = 128815, unused = 465)
- Donor vocabulary size : 151936
- Donor num layers : 24 (tied embeddings = True)
- Donor hidden size : 896
- Donor attention heads : 14
- Donor intermediate size : 4864 (ratio = 1:5.4)
- Donor total parameters : 494032768 (0.49B)
-- Embedding parameters : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)
Processing 3 automatic token overrides:
✔ 'bos_token_id' : 0 '<|begin▁of▁sentence|>' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 1 '<|end▁of▁sentence|>' → [151645] '<|im_end|>'
✘ 'pad_token_id' : 1 is already mapped to [151645]
Processing 18 manual token overrides:
✔ 0 : '<|begin▁of▁sentence|>' → [151643] '<|endoftext|>'
✔ 1 : '<|end▁of▁sentence|>' → [151645] '<|im_end|>'
✔ 2 : '<|▁pad▁|>' → [151643] '<|endoftext|>'
✔ 128800 : '<|fim▁hole|>' → [151660] '<|fim_middle|>'
✔ 128801 : '<|fim▁begin|>' → [151659] '<|fim_prefix|>'
✔ 128802 : '<|fim▁end|>' → [151661] '<|fim_suffix|>'
✔ 128803 : '<|User|>' → [151644, 872, 198] '<|im_start|>user\n'
✔ 128804 : '<|Assistant|>' → [151644, 77091, 198] '<|im_start|>assistant\n'
✔ 128805 : '<|EOT|>' → [151643] '<|endoftext|>'
✔ 128806 : '<|tool▁calls▁begin|>' → [151657] '<tool_call>'
✔ 128807 : '<|tool▁calls▁end|>' → [151658] '</tool_call>'
✔ 128808 : '<|tool▁call▁begin|>' → [151657] '<tool_call>'
✔ 128809 : '<|tool▁call▁end|>' → [151658] '</tool_call>'
✔ 128810 : '<|tool▁outputs▁begin|>' → [27, 14172, 9655, 29] '<tool_response>'
✔ 128811 : '<|tool▁outputs▁end|>' → [522, 14172, 9655, 29] '</tool_response>'
✔ 128812 : '<|tool▁output▁begin|>' → [27, 14172, 9655, 29] '<tool_response>'
✔ 128813 : '<|tool▁output▁end|>' → [522, 14172, 9655, 29] '</tool_response>'
✔ 128814 : '<|tool▁sep|>' → [151658] '</tool_call>'
NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...
Transplanting tokens: 100%|████████████████████████████████████████████████████████████| 128815/128815 [00:53<00:00, 2423.79token/s]
Transplant mappings:
- 1 to 1 : 83683 (65%)
- 2 to 1 : 38380 (30%)
- 3 to 1 : 4583 (3.6%)
- 4 to 1 : 927 (0.72%)
- 5 to 1 : 273 (0.21%)
- 6 to 1 : 91 (0.071%)
- 7 to 1 : 35 (0.027%)
- 8 to 1 : 22 (0.017%)
- 9 to 1 : 8 (0.0062%)
- 10 to 1 : 4 (0.0031%)
- 11 to 1 : 4 (0.0031%)
- 13 to 1 : 1 (0.00078%)
- 14 to 1 : 10 (0.0078%)
- 15 to 1 : 91 (0.071%)
- 16 to 1 : 701 (0.54%)
- 19 to 1 : 1 (0.00078%)
- 21 to 1 : 1 (0.00078%)
Head initialized with:
- Copies : 83683 (65%)
- Means : 45132 (35%)
- Zeros : 465 (0.36%)
Output model configuration:
- Output vocabulary size : 129280
- Output num layers : 24 (tied embeddings = False)
- Output hidden size : 896
- Output attention heads : 14
- Output intermediate size : 4864 (ratio = 1:5.4)
- Output total parameters : 589567872 (0.59B)
-- Embedding parameters : 231669760 (0.23B)
-- Non-embedding parameters : 357898112 (0.36B)
Saving model and tokenizer to 'DeepSeek-V3-DRAFT-0.6B-UNTRAINED' folder
[2025-08-07 15:36:33,693] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Patching 'torch_dtype' in 'DeepSeek-V3-DRAFT-0.6B-UNTRAINED/config.json' based on actual saved tensors
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype
Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
output
field only)formatted just between <|end▁of▁sentence|>
tags.
# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================
model_dir = 'models/DeepSeek-V3-DRAFT-0.6B-UNTRAINED'
output_dir = 'finetuned'
# ===========================
# TRAINING TYPE CONFIGURATION
# ===========================
full_fine_tune = true
# =======================
# OPTIMIZER CONFIGURATION
# =======================
lr = 5e-5
# ======================
# TRAINING CONFIGURATION
# ======================
sequence_len = 32768
gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step
# =====================
# DATASET CONFIGURATION
# =====================
[[datasets]]
dataset_path = 'datasets/common-crawl-sample/*.json'
drop_tails = true
[[datasets]]
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
drop_tails = true
[[datasets]]
dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
drop_tails = true
I used six RTX A6000
GPUs over three nodes and hence the 60
batch size (6 x 10 gradient accumulation steps = 60
):