A 0.6B
parameter draft (speculative decoding) model for use with Kimi-K2-Instruct.
See Kimi-K2-Instruct-DRAFT-0.6B-v3.0-GGUF for the models in gguf
format for use with llama.cpp
.
Extending the context above 32k
The current config.json
is set for context length up to 32k tokens. Add the "rope_scaling"
section to config.json
to enable YaRN, eg:
To extend the context to 64k:
"max_position_embeddings": 65536,
...
"rope_scaling": {
"factor": 2.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
To extend the context to 128k:
"max_position_embeddings": 131072,
...
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
NOTE: Because llama.cpp
uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the rope_scaling
configuration when processing long contexts is required...
How this model was created
1. The initial model was created from Qwen2.5-0.5B-Instruct using transplant-vocab:
python ./transplant_vocab.py \
./Qwen2.5-0.5B-Instruct \
./Kimi-K2-Instruct \
./Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED \
--trust-remote-code \
--override "[BOS]" "<|endoftext|>" \
--override "[EOS]" "<|im_end|>" \
--override "<|im_end|>" "<|im_end|>" \
--override "<|im_user|>" "<|im_start|>user" \
--override "<|im_assistant|>" "<|im_start|>assistant" \
--override "<|start_header_id|>" "<|im_start|>" \
--override "<|end_header_id|>" "<|im_end|>" \
--override "[EOT]" "<|endoftext|>" \
--override "<|im_system|>" "<|im_start|>system" \
--override "<|tool_calls_section_begin|>" "<tool_call>" \
--override "<|tool_calls_section_end|>" "</tool_call>" \
--override "<|tool_call_begin|>" "<tool_call>" \
--override "<|tool_call_argument_begin|>" "<tool_call>" \
--override "<|tool_call_end|>" "</tool_call>" \
--override "<|im_middle|>" "\\n" \
--override "[UNK]" "<|endoftext|>" \
--override "[PAD]" "<|endoftext|>"
Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
Loading config from 'Kimi-K2-Instruct'... Done.
Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
Loading tokenizer from 'Kimi-K2-Instruct'... Done.
Loading model from 'Qwen2.5-0.5B-Instruct'... Done.
Input model configuration:
- Target vocabulary size : 163840 (used = 163840, unused = 0)
- Donor vocabulary size : 151936
- Donor num layers : 24 (tied embeddings = True)
- Donor hidden size : 896
- Donor attention heads : 14
- Donor intermediate size : 4864 (ratio = 1:5.4)
- Donor total parameters : 494032768 (0.49B)
-- Embedding parameters : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)
Processing 3 automatic token overrides:
✔ 'bos_token_id' : 163584 '[BOS]' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 163585 '[EOS]' → [151645] '<|im_end|>'
✔ 'pad_token_id' : 163839 '[PAD]' → [151643] '<|endoftext|>'
Processing 17 manual token overrides:
✔ 163584 : '[BOS]' → [151643] '<|endoftext|>'
✔ 163585 : '[EOS]' → [151645] '<|im_end|>'
✔ 163586 : '<|im_end|>' → [151645] '<|im_end|>'
✔ 163587 : '<|im_user|>' → [151644, 872] '<|im_start|>user'
✔ 163588 : '<|im_assistant|>' → [151644, 77091] '<|im_start|>assistant'
✔ 163590 : '<|start_header_id|>' → [151644] '<|im_start|>'
✔ 163591 : '<|end_header_id|>' → [151645] '<|im_end|>'
✔ 163593 : '[EOT]' → [151643] '<|endoftext|>'
✔ 163594 : '<|im_system|>' → [151644, 8948] '<|im_start|>system'
✔ 163595 : '<|tool_calls_section_begin|>' → [151657] '<tool_call>'
✔ 163596 : '<|tool_calls_section_end|>' → [151658] '</tool_call>'
✔ 163597 : '<|tool_call_begin|>' → [151657] '<tool_call>'
✔ 163598 : '<|tool_call_argument_begin|>' → [151657] '<tool_call>'
✔ 163599 : '<|tool_call_end|>' → [151658] '</tool_call>'
✔ 163601 : '<|im_middle|>' → [198] '\n'
✔ 163838 : '[UNK]' → [151643] '<|endoftext|>'
✔ 163839 : '[PAD]' → [151643] '<|endoftext|>'
NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...
Transplanting tokens: 100%|████████████████████████████████████████████████████████████| 163840/163840 [01:08<00:00, 2406.47token/s]
Transplant mappings:
- 1 to 1 : 95449 (58%)
- 2 to 1 : 61938 (38%)
- 3 to 1 : 4995 (3%)
- 4 to 1 : 980 (0.6%)
- 5 to 1 : 147 (0.09%)
- 6 to 1 : 52 (0.032%)
- 7 to 1 : 15 (0.0092%)
- 8 to 1 : 17 (0.01%)
- 9 to 1 : 2 (0.0012%)
- 10 to 1 : 5 (0.0031%)
- 11 to 1 : 1 (0.00061%)
- 13 to 1 : 239 (0.15%)
Head initialized with:
- Copies : 95449 (58%)
- Means : 68391 (42%)
- Zeros : 0 (0%)
Output model configuration:
- Output vocabulary size : 163840
- Output num layers : 24 (tied embeddings = False)
- Output hidden size : 896
- Output attention heads : 14
- Output intermediate size : 4864 (ratio = 1:5.4)
- Output total parameters : 651499392 (0.65B)
-- Embedding parameters : 293601280 (0.29B)
-- Non-embedding parameters : 357898112 (0.36B)
Saving model and tokenizer to 'Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED' folder
[2025-08-07 15:47:15,620] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Patching 'torch_dtype' in 'Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED/config.json' based on actual saved tensors
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype
Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
NOTE: Due to the non-standard tokenizer, this needs the --trust-remote-code
option.
NOTE: I had to manually delete "pad_token_id": 163839
from config.json
to get it to match the tokeniser when used in llama.cpp
as a draft model.
2. The following datasets were used to create a fine-tuning dataset of ~2.3B tokens:
- agentlans/common-crawl-sample
- bigcode/the-stack-smol-xl
- rombodawg/Everything_Instruct (NOTE:
output
field only)
formatted just between [EOS]
tags.
3. The model was then trained using qlora-pipe-lite for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step):
# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================
model_dir = 'models/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED'
output_dir = 'finetuned'
# ===========================
# TRAINING TYPE CONFIGURATION
# ===========================
full_fine_tune = true
# =======================
# OPTIMIZER CONFIGURATION
# =======================
lr = 5e-5
# ======================
# TRAINING CONFIGURATION
# ======================
sequence_len = 32768
gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step
# =====================
# DATASET CONFIGURATION
# =====================
[[datasets]]
dataset_path = 'datasets/common-crawl-sample/*.json'
drop_tails = true
[[datasets]]
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
drop_tails = true
[[datasets]]
dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
drop_tails = true
NOTE: Due to the non-standard tokenizer, this needs the --trust-remote-code
option passing on the deepspeed
call to train.py
.
I used six RTX A6000
GPUs over three nodes and hence the 60
batch size (6 x 10 gradient accumulation steps = 60
):
4. Fixing the TikToken
/ SentencePiece
tokenizer mismatch in llama.cpp
I had to temporarily hack this change into convert_hf_to_gguf.py
:
@ModelBase.register("Qwen2Model", "Qwen2ForCausalLM", "Qwen2AudioForConditionalGeneration")
class Qwen2Model(TextModel):
model_arch = gguf.MODEL_ARCH.QWEN2
#def set_vocab(self):
# try:
# self._set_vocab_sentencepiece()
# except FileNotFoundError:
# self._set_vocab_gpt2()
def set_vocab(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
tokpre = self.get_vocab_base_pre(tokenizer)
# Build merges list using the approach similar to HunYuanMoE
merges = []
vocab = {}
mergeable_ranks = tokenizer.model._mergeable_ranks
for token, rank in mergeable_ranks.items():
vocab[QwenModel.token_bytes_to_string(token)] = rank
if len(token) == 1:
continue
merged = QwenModel.bpe(mergeable_ranks, token, max_rank=rank)
if len(merged) == 2:
merges.append(' '.join(map(QwenModel.token_bytes_to_string, merged)))
# Build token list
vocab_size = self.hparams["vocab_size"]
special_tokens = tokenizer.special_tokens
reverse_vocab = {id_ : encoded_tok for encoded_tok, id_ in {**vocab, **special_tokens}.items()}
tokens: list[str] = []
toktypes: list[int] = []
for i in range(vocab_size):
if i not in reverse_vocab:
tokens.append(f"[PAD{i}]")
toktypes.append(gguf.TokenType.UNUSED)
else:
token = reverse_vocab[i]
tokens.append(token)
if i in special_tokens.values():
toktypes.append(gguf.TokenType.CONTROL)
else:
toktypes.append(gguf.TokenType.NORMAL)
self.gguf_writer.add_tokenizer_model("gpt2")
self.gguf_writer.add_tokenizer_pre(tokpre)
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
self.gguf_writer.add_token_merges(merges)
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=False)
special_vocab.add_to_gguf(self.gguf_writer)
This then let me run:
~/llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile Kimi-K2-Instruct-DRAFT-0.6B-BF16.gguf Kimi-K2-Instruct-DRAFT-0.6B
and then it quantized OK:
~/llama.cpp/build/bin/llama-quantize Kimi-K2-Instruct-DRAFT-0.6B-BF16.gguf Kimi-K2-Instruct-DRAFT-0.6B-Q4_0.gguf Q4_0 44
- Downloads last month
- 31