Draft Models
Collection
Tiny "draft" models for speculative decoding.
•
32 items
•
Updated
•
2
These are YaRN extended versions of Qwen3-0.6B for use with:
llama.cpp
in #12635.I've included the Q4_0
quants for 4 different context lengths:
NOTE: Because llama.cpp
uses "static-YaRN" the scaling factor remains constant regardless of input length:
config.json
file: "max_position_embeddings": 65536,
...
"rope_scaling": {
"factor": 2.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
./llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile Qwen3-0.6B-64k-BF16.gguf Qwen3-0.6B
./llama.cpp/build/bin/llama-quantize Qwen3-0.6B-64k-BF16.gguf Qwen3-0.6B-64k-Q4_0.gguf Q4_0 44
config.json
file: "max_position_embeddings": 131072,
...
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
./llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile Qwen3-0.6B-128k-BF16.gguf Qwen3-0.6B
./llama.cpp/build/bin/llama-quantize Qwen3-0.6B-128k-BF16.gguf Qwen3-0.6B-128k-Q4_0.gguf Q4_0 44
config.json
file: "max_position_embeddings": 262144,
...
"rope_scaling": {
"factor": 8.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
./llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile Qwen3-0.6B-256k-BF16.gguf Qwen3-0.6B
./llama.cpp/build/bin/llama-quantize Qwen3-0.6B-256k-BF16.gguf Qwen3-0.6B-256k-Q4_0.gguf Q4_0 44
First work out the new scale factor, eg: for 1M context (2^20 = 1048576
) we need 1048576/32768 = 32.0
.
Copy one of the existing YaRN-extended GGUF files (ie: NOT Qwen3-0.6B-32k-Q4_0.gguf
!) and patch it using gguf_set_metadata.py
:
./llama.cpp/gguf-py/gguf/scripts/gguf_set_metadata.py Qwen3-0.6B-1M-Q4_0.gguf qwen3.context_length 1048576
./llama.cpp/gguf-py/gguf/scripts/gguf_set_metadata.py Qwen3-0.6B-1M-Q4_0.gguf qwen3.rope.scaling.factor 32.0
gguf_dump.py
:./llama.cpp/gguf-py/gguf/scripts/gguf_dump.py --no-tensors Qwen3-0.6B-1M-Q4_0.gguf
INFO:gguf-dump:* Loading: Qwen3-0.6B-1M-Q4_0.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 40 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
2: UINT64 | 1 | GGUF.tensor_count = 311
3: UINT64 | 1 | GGUF.kv_count = 37
4: STRING | 1 | general.architecture = 'qwen3'
5: STRING | 1 | general.type = 'model'
6: STRING | 1 | general.name = 'Qwen3 0.6B'
7: STRING | 1 | general.basename = 'Qwen3'
8: STRING | 1 | general.size_label = '0.6B'
9: STRING | 1 | general.license = 'apache-2.0'
10: STRING | 1 | general.license.link = 'https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/LICENSE'
11: UINT32 | 1 | general.base_model.count = 1
12: STRING | 1 | general.base_model.0.name = 'Qwen3 0.6B Base'
13: STRING | 1 | general.base_model.0.organization = 'Qwen'
14: STRING | 1 | general.base_model.0.repo_url = 'https://huggingface.co/Qwen/Qwen3-0.6B-Base'
15: [STRING] | 1 | general.tags = ['text-generation']
16: UINT32 | 1 | qwen3.block_count = 28
17: UINT32 | 1 | qwen3.context_length = 1048576
18: UINT32 | 1 | qwen3.embedding_length = 1024
19: UINT32 | 1 | qwen3.feed_forward_length = 3072
20: UINT32 | 1 | qwen3.attention.head_count = 16
21: UINT32 | 1 | qwen3.attention.head_count_kv = 8
22: FLOAT32 | 1 | qwen3.rope.freq_base = 1000000.0
23: FLOAT32 | 1 | qwen3.attention.layer_norm_rms_epsilon = 9.999999974752427e-07
24: UINT32 | 1 | qwen3.attention.key_length = 128
25: UINT32 | 1 | qwen3.attention.value_length = 128
26: STRING | 1 | qwen3.rope.scaling.type = 'yarn'
27: FLOAT32 | 1 | qwen3.rope.scaling.factor = 32.0
28: UINT32 | 1 | qwen3.rope.scaling.original_context_length = 32768
29: STRING | 1 | tokenizer.ggml.model = 'gpt2'
30: STRING | 1 | tokenizer.ggml.pre = 'qwen2'
31: [STRING] | 151936 | tokenizer.ggml.tokens = ['!', '"', '#', '$', '%', '&', ...]
32: [INT32] | 151936 | tokenizer.ggml.token_type = [1, 1, 1, 1, 1, 1, ...]
33: [STRING] | 151387 | tokenizer.ggml.merges = ['Ġ Ġ', 'ĠĠ ĠĠ', 'i n', 'Ġ t', 'ĠĠĠĠ ĠĠĠĠ', 'e r', ...]
34: UINT32 | 1 | tokenizer.ggml.eos_token_id = 151645
35: UINT32 | 1 | tokenizer.ggml.padding_token_id = 151643
36: UINT32 | 1 | tokenizer.ggml.bos_token_id = 151643
37: BOOL | 1 | tokenizer.ggml.add_bos_token = False
38: STRING | 1 | tokenizer.chat_template = "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%-..."
39: UINT32 | 1 | general.quantization_version = 2
40: UINT32 | 1 | general.file_type = 2
4-bit