Improve language tag

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show

README.md +144 -130

README.md CHANGED Viewed

@@ -1,131 +1,145 @@
----
-license: apache-2.0
-base_model:
-- Qwen/Qwen2.5-0.5B-Instruct
-datasets:
-- agentlans/common-crawl-sample
-- bigcode/the-stack-smol-xl
-- open-thoughts/OpenThoughts-Unverified-173k
-- cognitivecomputations/dolphin-r1
-tags:
-- draft
-- speculative-decoding
----
-![russian dolls.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/hAb6qi-c0wt4wA5pl4Qup.webp)
-A `0.5B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).
-**NOTE**: This is a draft model for the **full-sized** `DeepSeek-R1` model and not the smaller "distilled" models!
-See [jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) for the models in GGUF format.
----
-# How the model was created
-## 1. The initial model was created from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
-```sh
-python ./transplant_vocab.py \
-	Qwen2.5-0.5B-Instruct \
-	DeepSeek-R1-BF16 \
-	DeepSeek-R1-DRAFT-0.5B-UNTRAINED \
-	--trim-hidden-size 768 \
-	--override "<｜▁pad▁｜>" "<|endoftext|>" \
-	--override "<｜fim▁hole｜>" "<|fim_middle|>" \
-	--override "<｜fim▁begin｜>" "<|fim_prefix|>" \
-	--override "<｜fim▁end｜>" "<|fim_suffix|>" \
-	--override "<｜User｜>" "<|im_start|>user\\n" \
-	--override "<｜Assistant｜>" "<|im_start|>assistant\\n" \
-	--override "<|EOT|>" "<|endoftext|>" \
-	--override "<｜tool▁calls▁begin｜>" "<tool_call>" \
-	--override "<｜tool▁call▁begin｜>" "<tool_call>" \
-	--override "<｜tool▁outputs▁begin｜>" "<tool_call>" \
-	--override "<｜tool▁output▁begin｜>" "<tool_call>" \
-	--override "<｜tool▁calls▁end｜>" "</tool_call>" \
-	--override "<｜tool▁call▁end｜>" "</tool_call>" \
-	--override "<｜tool▁outputs▁end｜>" "</tool_call>" \
-	--override "<｜tool▁output▁end｜>" "</tool_call>" \
-	--override "<｜tool▁sep｜>" "</tool_call>"
-```
-**NOTE**: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in `top-1` eval was only around 2% (71% vs 73%), and [this small gain is then lost](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) by being forced to use `Q4_0` which has a ***much*** higher PPL.
-**NOTE**: I also tried trimming the hidden-size to 512 (and the heads to 8), but the `top-1` eval was significantly lower (63%).
-## 2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:
-- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
-- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
-- [open-thoughts/OpenThoughts-Unverified-173k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k)
-- [https://huggingface.co/datasets/cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1) *(300k reasoning samples from DeepSeek-R1 only)*
-**NOTE**: The first two datasets were formatted just between `<｜end▁of▁sentence｜>` tags, and the second two datasets using the proper `deepseek-r1` Jinga template (with `<think>` tags added around the reasoning, etc).
-This mix of data was chosen based on the ideas presented in [FastDraft: How to Train Your Draft](https://arxiv.org/abs/2411.11055v1). My first attempt at this did not include the raw-code data from `bigcode/the-stack-smol-xl` and did not perform as well as a result. This confirms their findings:
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/Uwk_KZtsnP9qyabcLEeEM.png)
-## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
-```toml
-# Resume a prior run
-resume_from_checkpoint = false
-# Paths
-model = 'DeepSeek-R1-DRAFT-0.5B-UNTRAINED'
-output_dir = 'DeepSeek-R1-DRAFT-0.5B'
-# Optimization configuration
-full_fine_tune = true
-epochs = 1
-lr_scheduler = 'cosine'
-warmup_steps = 100
-# Performance settings
-pipeline_stages = 1
-logging_steps = 1
-eval_steps = 100
-save_steps = 100
-checkpoint_every_n_minutes = 60
-eval_before_first_step = true
-eval_after_last_step = true
-model_weight_dtype = 'bfloat16'
-keep_states = 3
-group_by_length = true
-activation_checkpointing = 'unsloth'
-# Dataset configuration
-dataset_combination_mode = 'concatenate'
-eval_gradient_accumulation_steps = 20
-[optimizer]
-type = 'adamw_kahan'
-lr = 1e-4
-beta1 = 0.9
-beta2 = 0.999
-weight_decay = 0.01
-[[datasets]]
-name = 'mixed_data'
-dataset_type = 'textfile'
-dataset_path = 'mixed_data/*.txt'
-sequence_len = 32768
-eval_size = 0.01
-```
-```json
-{
-    "train_micro_batch_size_per_gpu": 1,
-    "gradient_accumulation_steps": 20,
-    "gradient_clipping": 1.0,
-    "steps_per_print": 1
-}
-```
-I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/KueEsBUdLCG17bN2qiEdB.png)
 As you can see `5B` tokens was overkill and around `1-1.5B` would have been OK (the 8-headed `0.33B` model needed at least `2-3B` tokens to recover performance though).

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen2.5-0.5B-Instruct
+datasets:
+- agentlans/common-crawl-sample
+- bigcode/the-stack-smol-xl
+- open-thoughts/OpenThoughts-Unverified-173k
+- cognitivecomputations/dolphin-r1
+tags:
+- draft
+- speculative-decoding
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+---
+![russian dolls.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/hAb6qi-c0wt4wA5pl4Qup.webp)
+A `0.5B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).
+**NOTE**: This is a draft model for the **full-sized** `DeepSeek-R1` model and not the smaller "distilled" models!
+See [jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) for the models in GGUF format.
+---
+# How the model was created
+## 1. The initial model was created from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
+```sh
+python ./transplant_vocab.py \
+	Qwen2.5-0.5B-Instruct \
+	DeepSeek-R1-BF16 \
+	DeepSeek-R1-DRAFT-0.5B-UNTRAINED \
+	--trim-hidden-size 768 \
+	--override "<｜▁pad▁｜>" "<|endoftext|>" \
+	--override "<｜fim▁hole｜>" "<|fim_middle|>" \
+	--override "<｜fim▁begin｜>" "<|fim_prefix|>" \
+	--override "<｜fim▁end｜>" "<|fim_suffix|>" \
+	--override "<｜User｜>" "<|im_start|>user\\n" \
+	--override "<｜Assistant｜>" "<|im_start|>assistant\\n" \
+	--override "<|EOT|>" "<|endoftext|>" \
+	--override "<｜tool▁calls▁begin｜>" "<tool_call>" \
+	--override "<｜tool▁call▁begin｜>" "<tool_call>" \
+	--override "<｜tool▁outputs▁begin｜>" "<tool_call>" \
+	--override "<｜tool▁output▁begin｜>" "<tool_call>" \
+	--override "<｜tool▁calls▁end｜>" "</tool_call>" \
+	--override "<｜tool▁call▁end｜>" "</tool_call>" \
+	--override "<｜tool▁outputs▁end｜>" "</tool_call>" \
+	--override "<｜tool▁output▁end｜>" "</tool_call>" \
+	--override "<｜tool▁sep｜>" "</tool_call>"
+```
+**NOTE**: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in `top-1` eval was only around 2% (71% vs 73%), and [this small gain is then lost](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) by being forced to use `Q4_0` which has a ***much*** higher PPL.
+**NOTE**: I also tried trimming the hidden-size to 512 (and the heads to 8), but the `top-1` eval was significantly lower (63%).
+## 2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:
+- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
+- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
+- [open-thoughts/OpenThoughts-Unverified-173k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k)
+- [https://huggingface.co/datasets/cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1) *(300k reasoning samples from DeepSeek-R1 only)*
+**NOTE**: The first two datasets were formatted just between `<｜end▁of▁sentence｜>` tags, and the second two datasets using the proper `deepseek-r1` Jinga template (with `<think>` tags added around the reasoning, etc).
+This mix of data was chosen based on the ideas presented in [FastDraft: How to Train Your Draft](https://arxiv.org/abs/2411.11055v1). My first attempt at this did not include the raw-code data from `bigcode/the-stack-smol-xl` and did not perform as well as a result. This confirms their findings:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/Uwk_KZtsnP9qyabcLEeEM.png)
+## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
+```toml
+# Resume a prior run
+resume_from_checkpoint = false
+# Paths
+model = 'DeepSeek-R1-DRAFT-0.5B-UNTRAINED'
+output_dir = 'DeepSeek-R1-DRAFT-0.5B'
+# Optimization configuration
+full_fine_tune = true
+epochs = 1
+lr_scheduler = 'cosine'
+warmup_steps = 100
+# Performance settings
+pipeline_stages = 1
+logging_steps = 1
+eval_steps = 100
+save_steps = 100
+checkpoint_every_n_minutes = 60
+eval_before_first_step = true
+eval_after_last_step = true
+model_weight_dtype = 'bfloat16'
+keep_states = 3
+group_by_length = true
+activation_checkpointing = 'unsloth'
+# Dataset configuration
+dataset_combination_mode = 'concatenate'
+eval_gradient_accumulation_steps = 20
+[optimizer]
+type = 'adamw_kahan'
+lr = 1e-4
+beta1 = 0.9
+beta2 = 0.999
+weight_decay = 0.01
+[[datasets]]
+name = 'mixed_data'
+dataset_type = 'textfile'
+dataset_path = 'mixed_data/*.txt'
+sequence_len = 32768
+eval_size = 0.01
+```
+```json
+{
+    "train_micro_batch_size_per_gpu": 1,
+    "gradient_accumulation_steps": 20,
+    "gradient_clipping": 1.0,
+    "steps_per_print": 1
+}
+```
+I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/KueEsBUdLCG17bN2qiEdB.png)
 As you can see `5B` tokens was overkill and around `1-1.5B` would have been OK (the 8-headed `0.33B` model needed at least `2-3B` tokens to recover performance though).