These are YaRN extended versions of Qwen3-0.6B for use with:


I've included the Q4_0 quants for 4 different context lengths:

NOTE: Because llama.cpp uses "static-YaRN" the scaling factor remains constant regardless of input length:

  • Only use the YaRN-extended versions when processing long contexts is required.
  • Use the smallest YaRN-extension possible.

How these were created

To extend the context to 64k:

  1. Edit the config.json file:
  "max_position_embeddings": 65536,
  ...
  "rope_scaling": {
    "factor": 2.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },
  1. Convert and quantize:
./llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile  Qwen3-0.6B-64k-BF16.gguf Qwen3-0.6B
./llama.cpp/build/bin/llama-quantize Qwen3-0.6B-64k-BF16.gguf Qwen3-0.6B-64k-Q4_0.gguf Q4_0 44

To extend the context to 128k:

  1. Edit the config.json file:
  "max_position_embeddings": 131072,
  ...
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },
  1. Convert and quantize:
./llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile  Qwen3-0.6B-128k-BF16.gguf Qwen3-0.6B
./llama.cpp/build/bin/llama-quantize Qwen3-0.6B-128k-BF16.gguf Qwen3-0.6B-128k-Q4_0.gguf Q4_0 44

To extend the context to 256k:

  1. Edit the config.json file:
  "max_position_embeddings": 262144,
  ...
  "rope_scaling": {
    "factor": 8.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },
  1. Convert and quantize:
./llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile  Qwen3-0.6B-256k-BF16.gguf Qwen3-0.6B
./llama.cpp/build/bin/llama-quantize Qwen3-0.6B-256k-BF16.gguf Qwen3-0.6B-256k-Q4_0.gguf Q4_0 44

How to patch the GGUF files for other context lengths

  1. First work out the new scale factor, eg: for 1M context (2^20 = 1048576) we need 1048576/32768 = 32.0.

  2. Copy one of the existing YaRN-extended GGUF files (ie: NOT Qwen3-0.6B-32k-Q4_0.gguf!) and patch it using gguf_set_metadata.py:

./llama.cpp/gguf-py/gguf/scripts/gguf_set_metadata.py Qwen3-0.6B-1M-Q4_0.gguf qwen3.context_length 1048576
./llama.cpp/gguf-py/gguf/scripts/gguf_set_metadata.py Qwen3-0.6B-1M-Q4_0.gguf qwen3.rope.scaling.factor 32.0
  1. Check the patch has worked using gguf_dump.py:
./llama.cpp/gguf-py/gguf/scripts/gguf_dump.py --no-tensors  Qwen3-0.6B-1M-Q4_0.gguf
INFO:gguf-dump:* Loading: Qwen3-0.6B-1M-Q4_0.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 40 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 311
      3: UINT64     |        1 | GGUF.kv_count = 37
      4: STRING     |        1 | general.architecture = 'qwen3'
      5: STRING     |        1 | general.type = 'model'
      6: STRING     |        1 | general.name = 'Qwen3 0.6B'
      7: STRING     |        1 | general.basename = 'Qwen3'
      8: STRING     |        1 | general.size_label = '0.6B'
      9: STRING     |        1 | general.license = 'apache-2.0'
     10: STRING     |        1 | general.license.link = 'https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/LICENSE'
     11: UINT32     |        1 | general.base_model.count = 1
     12: STRING     |        1 | general.base_model.0.name = 'Qwen3 0.6B Base'
     13: STRING     |        1 | general.base_model.0.organization = 'Qwen'
     14: STRING     |        1 | general.base_model.0.repo_url = 'https://huggingface.co/Qwen/Qwen3-0.6B-Base'
     15: [STRING]   |        1 | general.tags = ['text-generation']
     16: UINT32     |        1 | qwen3.block_count = 28
     17: UINT32     |        1 | qwen3.context_length = 1048576
     18: UINT32     |        1 | qwen3.embedding_length = 1024
     19: UINT32     |        1 | qwen3.feed_forward_length = 3072
     20: UINT32     |        1 | qwen3.attention.head_count = 16
     21: UINT32     |        1 | qwen3.attention.head_count_kv = 8
     22: FLOAT32    |        1 | qwen3.rope.freq_base = 1000000.0
     23: FLOAT32    |        1 | qwen3.attention.layer_norm_rms_epsilon = 9.999999974752427e-07
     24: UINT32     |        1 | qwen3.attention.key_length = 128
     25: UINT32     |        1 | qwen3.attention.value_length = 128
     26: STRING     |        1 | qwen3.rope.scaling.type = 'yarn'
     27: FLOAT32    |        1 | qwen3.rope.scaling.factor = 32.0
     28: UINT32     |        1 | qwen3.rope.scaling.original_context_length = 32768
     29: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     30: STRING     |        1 | tokenizer.ggml.pre = 'qwen2'
     31: [STRING]   |   151936 | tokenizer.ggml.tokens = ['!', '"', '#', '$', '%', '&', ...]
     32: [INT32]    |   151936 | tokenizer.ggml.token_type = [1, 1, 1, 1, 1, 1, ...]
     33: [STRING]   |   151387 | tokenizer.ggml.merges = ['Ġ Ġ', 'ĠĠ ĠĠ', 'i n', 'Ġ t', 'ĠĠĠĠ ĠĠĠĠ', 'e r', ...]
     34: UINT32     |        1 | tokenizer.ggml.eos_token_id = 151645
     35: UINT32     |        1 | tokenizer.ggml.padding_token_id = 151643
     36: UINT32     |        1 | tokenizer.ggml.bos_token_id = 151643
     37: BOOL       |        1 | tokenizer.ggml.add_bos_token = False
     38: STRING     |        1 | tokenizer.chat_template = "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%-..."
     39: UINT32     |        1 | general.quantization_version = 2
     40: UINT32     |        1 | general.file_type = 2
Downloads last month
139
GGUF
Model size
752M params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jukofyork/Qwen3-0.6B-YaRN-GGUF

Finetuned
Qwen/Qwen3-0.6B
Quantized
(144)
this model

Collection including jukofyork/Qwen3-0.6B-YaRN-GGUF