jukofyork
/

DeepSeek-V3-0324-DRAFT-0.5B-v1.0

Safetensors

qwen2

draft

speculative-decoding

Model card Files Files and versions Community

Improve language tag

by lbourdois - opened 5 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+134

-120

Files changed (1) hide show

README.md +134 -120

README.md CHANGED Viewed

@@ -1,121 +1,135 @@
----
-license: apache-2.0
-base_model:
-- Qwen/Qwen2.5-0.5B-Instruct
-datasets:
-- agentlans/common-crawl-sample
-- bigcode/the-stack-smol-xl
-- open-thoughts/OpenThoughts-Unverified-173k
-- cognitivecomputations/dolphin-r1
-tags:
-- draft
-- speculative-decoding
----
-![image-3.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/pqAVNCYd1BV2ljTFwO9Ab.webp)
-A `0.5B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324).
-See [jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF) for the models in GGUF format.
----
-# How the model was created
-## 1. The initial model was created from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
-```sh
-python ./transplant_vocab.py \
-	Qwen2.5-0.5B-Instruct \
-	DeepSeek-V3-0324-BF16 \
-	DeepSeek-V3-0324-DRAFT-0.5B-UNTRAINED \
-	--trim-hidden-size 768 \
-	--override "<｜▁pad▁｜>" "<|endoftext|>" \
-	--override "<｜fim▁hole｜>" "<|fim_middle|>" \
-	--override "<｜fim▁begin｜>" "<|fim_prefix|>" \
-	--override "<｜fim▁end｜>" "<|fim_suffix|>" \
-	--override "<｜User｜>" "<|im_start|>user\\n" \
-	--override "<｜Assistant｜>" "<|im_start|>assistant\\n" \
-	--override "<|EOT|>" "<|endoftext|>" \
-	--override "<｜tool▁calls▁begin｜>" "<tool_call>" \
-	--override "<｜tool▁call▁begin｜>" "<tool_call>" \
-	--override "<｜tool▁outputs▁begin｜>" "<tool_call>" \
-	--override "<｜tool▁output▁begin｜>" "<tool_call>" \
-	--override "<｜tool▁calls▁end｜>" "</tool_call>" \
-	--override "<｜tool▁call▁end｜>" "</tool_call>" \
-	--override "<｜tool▁outputs▁end｜>" "</tool_call>" \
-	--override "<｜tool▁output▁end｜>" "</tool_call>" \
-	--override "<｜tool▁sep｜>" "</tool_call>"
-```
-**NOTE**: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in `top-1` eval was only around 2% (71% vs 73%), and [this small gain is then lost](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) by being forced to use `Q4_0` which has a ***much*** higher PPL.
-**NOTE**: I also tried trimming the hidden-size to 512 (and the heads to 8), but the `top-1` eval was significantly lower (63%).
-## 2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:
-- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
-- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
-- [open-thoughts/OpenThoughts-Unverified-173k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k)
-- [https://huggingface.co/datasets/cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1) *(300k reasoning samples from DeepSeek-R1 only)*
-**NOTE**: The first two datasets were formatted just between `<｜end▁of▁sentence｜>` tags, and the second two datasets using the proper `deepseek-v3` Jinga template.
-## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
-```toml
-# Resume a prior run
-resume_from_checkpoint = false
-# Paths
-model = 'DeepSeek-V3-0324-DRAFT-0.5B-UNTRAINED'
-output_dir = 'DeepSeek-V3-0324-DRAFT-0.5B'
-# Optimization configuration
-full_fine_tune = true
-epochs = 1
-lr_scheduler = 'cosine'
-warmup_steps = 100
-# Performance settings
-pipeline_stages = 1
-logging_steps = 1
-eval_steps = 100
-save_steps = 100
-checkpoint_every_n_minutes = 60
-eval_before_first_step = true
-eval_after_last_step = true
-model_weight_dtype = 'bfloat16'
-keep_states = 3
-group_by_length = true
-activation_checkpointing = 'unsloth'
-# Dataset configuration
-dataset_combination_mode = 'concatenate'
-eval_gradient_accumulation_steps = 20
-[optimizer]
-type = 'adamw_kahan'
-lr = 1e-4
-beta1 = 0.9
-beta2 = 0.999
-weight_decay = 0.01
-[[datasets]]
-name = 'mixed_data'
-dataset_type = 'textfile'
-dataset_path = 'mixed_data/*.txt'
-sequence_len = 32768
-eval_size = 0.01
-```
-```json
-{
-    "train_micro_batch_size_per_gpu": 1,
-    "gradient_accumulation_steps": 20,
-    "gradient_clipping": 1.0,
-    "steps_per_print": 1
-}
-```
 I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen2.5-0.5B-Instruct
+datasets:
+- agentlans/common-crawl-sample
+- bigcode/the-stack-smol-xl
+- open-thoughts/OpenThoughts-Unverified-173k
+- cognitivecomputations/dolphin-r1
+tags:
+- draft
+- speculative-decoding
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+---
+![image-3.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/pqAVNCYd1BV2ljTFwO9Ab.webp)
+A `0.5B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324).
+See [jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF) for the models in GGUF format.
+---
+# How the model was created
+## 1. The initial model was created from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
+```sh
+python ./transplant_vocab.py \
+	Qwen2.5-0.5B-Instruct \
+	DeepSeek-V3-0324-BF16 \
+	DeepSeek-V3-0324-DRAFT-0.5B-UNTRAINED \
+	--trim-hidden-size 768 \
+	--override "<｜▁pad▁｜>" "<|endoftext|>" \
+	--override "<｜fim▁hole｜>" "<|fim_middle|>" \
+	--override "<｜fim▁begin｜>" "<|fim_prefix|>" \
+	--override "<｜fim▁end｜>" "<|fim_suffix|>" \
+	--override "<｜User｜>" "<|im_start|>user\\n" \
+	--override "<｜Assistant｜>" "<|im_start|>assistant\\n" \
+	--override "<|EOT|>" "<|endoftext|>" \
+	--override "<｜tool▁calls▁begin｜>" "<tool_call>" \
+	--override "<｜tool▁call▁begin｜>" "<tool_call>" \
+	--override "<｜tool▁outputs▁begin｜>" "<tool_call>" \
+	--override "<｜tool▁output▁begin｜>" "<tool_call>" \
+	--override "<｜tool▁calls▁end｜>" "</tool_call>" \
+	--override "<｜tool▁call▁end｜>" "</tool_call>" \
+	--override "<｜tool▁outputs▁end｜>" "</tool_call>" \
+	--override "<｜tool▁output▁end｜>" "</tool_call>" \
+	--override "<｜tool▁sep｜>" "</tool_call>"
+```
+**NOTE**: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in `top-1` eval was only around 2% (71% vs 73%), and [this small gain is then lost](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) by being forced to use `Q4_0` which has a ***much*** higher PPL.
+**NOTE**: I also tried trimming the hidden-size to 512 (and the heads to 8), but the `top-1` eval was significantly lower (63%).
+## 2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:
+- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
+- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
+- [open-thoughts/OpenThoughts-Unverified-173k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k)
+- [https://huggingface.co/datasets/cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1) *(300k reasoning samples from DeepSeek-R1 only)*
+**NOTE**: The first two datasets were formatted just between `<｜end▁of▁sentence｜>` tags, and the second two datasets using the proper `deepseek-v3` Jinga template.
+## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
+```toml
+# Resume a prior run
+resume_from_checkpoint = false
+# Paths
+model = 'DeepSeek-V3-0324-DRAFT-0.5B-UNTRAINED'
+output_dir = 'DeepSeek-V3-0324-DRAFT-0.5B'
+# Optimization configuration
+full_fine_tune = true
+epochs = 1
+lr_scheduler = 'cosine'
+warmup_steps = 100
+# Performance settings
+pipeline_stages = 1
+logging_steps = 1
+eval_steps = 100
+save_steps = 100
+checkpoint_every_n_minutes = 60
+eval_before_first_step = true
+eval_after_last_step = true
+model_weight_dtype = 'bfloat16'
+keep_states = 3
+group_by_length = true
+activation_checkpointing = 'unsloth'
+# Dataset configuration
+dataset_combination_mode = 'concatenate'
+eval_gradient_accumulation_steps = 20
+[optimizer]
+type = 'adamw_kahan'
+lr = 1e-4
+beta1 = 0.9
+beta2 = 0.999
+weight_decay = 0.01
+[[datasets]]
+name = 'mixed_data'
+dataset_type = 'textfile'
+dataset_path = 'mixed_data/*.txt'
+sequence_len = 32768
+eval_size = 0.01
+```
+```json
+{
+    "train_micro_batch_size_per_gpu": 1,
+    "gradient_accumulation_steps": 20,
+    "gradient_clipping": 1.0,
+    "steps_per_print": 1
+}
+```
 I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).