Files changed (1) hide show
  1. README.md +134 -120
README.md CHANGED
@@ -1,121 +1,135 @@
1
- ---
2
- license: apache-2.0
3
- base_model:
4
- - Qwen/Qwen2.5-0.5B-Instruct
5
- datasets:
6
- - agentlans/common-crawl-sample
7
- - bigcode/the-stack-smol-xl
8
- - open-thoughts/OpenThoughts-Unverified-173k
9
- - cognitivecomputations/dolphin-r1
10
- tags:
11
- - draft
12
- - speculative-decoding
13
- ---
14
-
15
- ![image-3.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/pqAVNCYd1BV2ljTFwO9Ab.webp)
16
-
17
- A `0.5B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324).
18
-
19
- See [jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF) for the models in GGUF format.
20
-
21
- ---
22
-
23
- # How the model was created
24
-
25
- ## 1. The initial model was created from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
26
-
27
- ```sh
28
- python ./transplant_vocab.py \
29
- Qwen2.5-0.5B-Instruct \
30
- DeepSeek-V3-0324-BF16 \
31
- DeepSeek-V3-0324-DRAFT-0.5B-UNTRAINED \
32
- --trim-hidden-size 768 \
33
- --override "<|▁pad▁|>" "<|endoftext|>" \
34
- --override "<|fim▁hole|>" "<|fim_middle|>" \
35
- --override "<|fim▁begin|>" "<|fim_prefix|>" \
36
- --override "<|fim▁end|>" "<|fim_suffix|>" \
37
- --override "<|User|>" "<|im_start|>user\\n" \
38
- --override "<|Assistant|>" "<|im_start|>assistant\\n" \
39
- --override "<|EOT|>" "<|endoftext|>" \
40
- --override "<|tool▁calls▁begin|>" "<tool_call>" \
41
- --override "<|tool▁call▁begin|>" "<tool_call>" \
42
- --override "<|tool▁outputs▁begin|>" "<tool_call>" \
43
- --override "<|tool▁output▁begin|>" "<tool_call>" \
44
- --override "<|tool▁calls▁end|>" "</tool_call>" \
45
- --override "<|tool▁call▁end|>" "</tool_call>" \
46
- --override "<|tool▁outputs▁end|>" "</tool_call>" \
47
- --override "<|tool▁output▁end|>" "</tool_call>" \
48
- --override "<|toolsep|>" "</tool_call>"
49
- ```
50
-
51
- **NOTE**: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in `top-1` eval was only around 2% (71% vs 73%), and [this small gain is then lost](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) by being forced to use `Q4_0` which has a ***much*** higher PPL.
52
-
53
- **NOTE**: I also tried trimming the hidden-size to 512 (and the heads to 8), but the `top-1` eval was significantly lower (63%).
54
-
55
- ## 2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:
56
-
57
- - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
58
- - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
59
- - [open-thoughts/OpenThoughts-Unverified-173k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k)
60
- - [https://huggingface.co/datasets/cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1) *(300k reasoning samples from DeepSeek-R1 only)*
61
-
62
- **NOTE**: The first two datasets were formatted just between `<|endof▁sentence|>` tags, and the second two datasets using the proper `deepseek-v3` Jinga template.
63
-
64
- ## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
65
-
66
- ```toml
67
- # Resume a prior run
68
- resume_from_checkpoint = false
69
-
70
- # Paths
71
- model = 'DeepSeek-V3-0324-DRAFT-0.5B-UNTRAINED'
72
- output_dir = 'DeepSeek-V3-0324-DRAFT-0.5B'
73
-
74
- # Optimization configuration
75
- full_fine_tune = true
76
- epochs = 1
77
- lr_scheduler = 'cosine'
78
- warmup_steps = 100
79
-
80
- # Performance settings
81
- pipeline_stages = 1
82
- logging_steps = 1
83
- eval_steps = 100
84
- save_steps = 100
85
- checkpoint_every_n_minutes = 60
86
- eval_before_first_step = true
87
- eval_after_last_step = true
88
- model_weight_dtype = 'bfloat16'
89
- keep_states = 3
90
- group_by_length = true
91
- activation_checkpointing = 'unsloth'
92
-
93
- # Dataset configuration
94
- dataset_combination_mode = 'concatenate'
95
- eval_gradient_accumulation_steps = 20
96
-
97
- [optimizer]
98
- type = 'adamw_kahan'
99
- lr = 1e-4
100
- beta1 = 0.9
101
- beta2 = 0.999
102
- weight_decay = 0.01
103
-
104
- [[datasets]]
105
- name = 'mixed_data'
106
- dataset_type = 'textfile'
107
- dataset_path = 'mixed_data/*.txt'
108
- sequence_len = 32768
109
- eval_size = 0.01
110
- ```
111
-
112
- ```json
113
- {
114
- "train_micro_batch_size_per_gpu": 1,
115
- "gradient_accumulation_steps": 20,
116
- "gradient_clipping": 1.0,
117
- "steps_per_print": 1
118
- }
119
- ```
120
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-0.5B-Instruct
5
+ datasets:
6
+ - agentlans/common-crawl-sample
7
+ - bigcode/the-stack-smol-xl
8
+ - open-thoughts/OpenThoughts-Unverified-173k
9
+ - cognitivecomputations/dolphin-r1
10
+ tags:
11
+ - draft
12
+ - speculative-decoding
13
+ language:
14
+ - zho
15
+ - eng
16
+ - fra
17
+ - spa
18
+ - por
19
+ - deu
20
+ - ita
21
+ - rus
22
+ - jpn
23
+ - kor
24
+ - vie
25
+ - tha
26
+ - ara
27
+ ---
28
+
29
+ ![image-3.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/pqAVNCYd1BV2ljTFwO9Ab.webp)
30
+
31
+ A `0.5B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324).
32
+
33
+ See [jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF) for the models in GGUF format.
34
+
35
+ ---
36
+
37
+ # How the model was created
38
+
39
+ ## 1. The initial model was created from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
40
+
41
+ ```sh
42
+ python ./transplant_vocab.py \
43
+ Qwen2.5-0.5B-Instruct \
44
+ DeepSeek-V3-0324-BF16 \
45
+ DeepSeek-V3-0324-DRAFT-0.5B-UNTRAINED \
46
+ --trim-hidden-size 768 \
47
+ --override "<|▁pad▁|>" "<|endoftext|>" \
48
+ --override "<|fimhole|>" "<|fim_middle|>" \
49
+ --override "<|fim▁begin|>" "<|fim_prefix|>" \
50
+ --override "<|fim▁end|>" "<|fim_suffix|>" \
51
+ --override "<|User|>" "<|im_start|>user\\n" \
52
+ --override "<|Assistant|>" "<|im_start|>assistant\\n" \
53
+ --override "<|EOT|>" "<|endoftext|>" \
54
+ --override "<|tool▁calls▁begin|>" "<tool_call>" \
55
+ --override "<|tool▁call▁begin|>" "<tool_call>" \
56
+ --override "<|tool▁outputs▁begin|>" "<tool_call>" \
57
+ --override "<|tool▁output▁begin|>" "<tool_call>" \
58
+ --override "<|tool▁calls▁end|>" "</tool_call>" \
59
+ --override "<|tool▁call▁end|>" "</tool_call>" \
60
+ --override "<|tool▁outputs▁end|>" "</tool_call>" \
61
+ --override "<|tool▁output▁end|>" "</tool_call>" \
62
+ --override "<|toolsep|>" "</tool_call>"
63
+ ```
64
+
65
+ **NOTE**: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in `top-1` eval was only around 2% (71% vs 73%), and [this small gain is then lost](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) by being forced to use `Q4_0` which has a ***much*** higher PPL.
66
+
67
+ **NOTE**: I also tried trimming the hidden-size to 512 (and the heads to 8), but the `top-1` eval was significantly lower (63%).
68
+
69
+ ## 2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:
70
+
71
+ - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
72
+ - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
73
+ - [open-thoughts/OpenThoughts-Unverified-173k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k)
74
+ - [https://huggingface.co/datasets/cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1) *(300k reasoning samples from DeepSeek-R1 only)*
75
+
76
+ **NOTE**: The first two datasets were formatted just between `<|end▁of▁sentence|>` tags, and the second two datasets using the proper `deepseek-v3` Jinga template.
77
+
78
+ ## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
79
+
80
+ ```toml
81
+ # Resume a prior run
82
+ resume_from_checkpoint = false
83
+
84
+ # Paths
85
+ model = 'DeepSeek-V3-0324-DRAFT-0.5B-UNTRAINED'
86
+ output_dir = 'DeepSeek-V3-0324-DRAFT-0.5B'
87
+
88
+ # Optimization configuration
89
+ full_fine_tune = true
90
+ epochs = 1
91
+ lr_scheduler = 'cosine'
92
+ warmup_steps = 100
93
+
94
+ # Performance settings
95
+ pipeline_stages = 1
96
+ logging_steps = 1
97
+ eval_steps = 100
98
+ save_steps = 100
99
+ checkpoint_every_n_minutes = 60
100
+ eval_before_first_step = true
101
+ eval_after_last_step = true
102
+ model_weight_dtype = 'bfloat16'
103
+ keep_states = 3
104
+ group_by_length = true
105
+ activation_checkpointing = 'unsloth'
106
+
107
+ # Dataset configuration
108
+ dataset_combination_mode = 'concatenate'
109
+ eval_gradient_accumulation_steps = 20
110
+
111
+ [optimizer]
112
+ type = 'adamw_kahan'
113
+ lr = 1e-4
114
+ beta1 = 0.9
115
+ beta2 = 0.999
116
+ weight_decay = 0.01
117
+
118
+ [[datasets]]
119
+ name = 'mixed_data'
120
+ dataset_type = 'textfile'
121
+ dataset_path = 'mixed_data/*.txt'
122
+ sequence_len = 32768
123
+ eval_size = 0.01
124
+ ```
125
+
126
+ ```json
127
+ {
128
+ "train_micro_batch_size_per_gpu": 1,
129
+ "gradient_accumulation_steps": 20,
130
+ "gradient_clipping": 1.0,
131
+ "steps_per_print": 1
132
+ }
133
+ ```
134
+
135
  I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).