lbourdois commited on
Commit
97acbe0
·
verified ·
1 Parent(s): 38e6b19

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +144 -130
README.md CHANGED
@@ -1,131 +1,145 @@
1
- ---
2
- license: apache-2.0
3
- base_model:
4
- - Qwen/Qwen2.5-0.5B-Instruct
5
- datasets:
6
- - agentlans/common-crawl-sample
7
- - bigcode/the-stack-smol-xl
8
- - open-thoughts/OpenThoughts-Unverified-173k
9
- - cognitivecomputations/dolphin-r1
10
- tags:
11
- - draft
12
- - speculative-decoding
13
- ---
14
-
15
- ![russian dolls.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/hAb6qi-c0wt4wA5pl4Qup.webp)
16
-
17
- A `0.5B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).
18
-
19
- **NOTE**: This is a draft model for the **full-sized** `DeepSeek-R1` model and not the smaller "distilled" models!
20
-
21
- See [jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) for the models in GGUF format.
22
-
23
- ---
24
-
25
- # How the model was created
26
-
27
- ## 1. The initial model was created from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
28
-
29
- ```sh
30
- python ./transplant_vocab.py \
31
- Qwen2.5-0.5B-Instruct \
32
- DeepSeek-R1-BF16 \
33
- DeepSeek-R1-DRAFT-0.5B-UNTRAINED \
34
- --trim-hidden-size 768 \
35
- --override "<|▁pad▁|>" "<|endoftext|>" \
36
- --override "<|fim▁hole|>" "<|fim_middle|>" \
37
- --override "<|fim▁begin|>" "<|fim_prefix|>" \
38
- --override "<|fim▁end|>" "<|fim_suffix|>" \
39
- --override "<|User|>" "<|im_start|>user\\n" \
40
- --override "<|Assistant|>" "<|im_start|>assistant\\n" \
41
- --override "<|EOT|>" "<|endoftext|>" \
42
- --override "<|tool▁calls▁begin|>" "<tool_call>" \
43
- --override "<|tool▁call▁begin|>" "<tool_call>" \
44
- --override "<|tool▁outputs▁begin|>" "<tool_call>" \
45
- --override "<|tool▁output▁begin|>" "<tool_call>" \
46
- --override "<|tool▁calls▁end|>" "</tool_call>" \
47
- --override "<|tool▁call▁end|>" "</tool_call>" \
48
- --override "<|tool▁outputs▁end|>" "</tool_call>" \
49
- --override "<|tool▁output▁end|>" "</tool_call>" \
50
- --override "<|toolsep|>" "</tool_call>"
51
- ```
52
-
53
- **NOTE**: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in `top-1` eval was only around 2% (71% vs 73%), and [this small gain is then lost](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) by being forced to use `Q4_0` which has a ***much*** higher PPL.
54
-
55
- **NOTE**: I also tried trimming the hidden-size to 512 (and the heads to 8), but the `top-1` eval was significantly lower (63%).
56
-
57
- ## 2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:
58
-
59
- - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
60
- - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
61
- - [open-thoughts/OpenThoughts-Unverified-173k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k)
62
- - [https://huggingface.co/datasets/cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1) *(300k reasoning samples from DeepSeek-R1 only)*
63
-
64
- **NOTE**: The first two datasets were formatted just between `<|endof▁sentence|>` tags, and the second two datasets using the proper `deepseek-r1` Jinga template (with `<think>` tags added around the reasoning, etc).
65
-
66
- This mix of data was chosen based on the ideas presented in [FastDraft: How to Train Your Draft](https://arxiv.org/abs/2411.11055v1). My first attempt at this did not include the raw-code data from `bigcode/the-stack-smol-xl` and did not perform as well as a result. This confirms their findings:
67
-
68
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/Uwk_KZtsnP9qyabcLEeEM.png)
69
-
70
- ## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
71
-
72
- ```toml
73
- # Resume a prior run
74
- resume_from_checkpoint = false
75
-
76
- # Paths
77
- model = 'DeepSeek-R1-DRAFT-0.5B-UNTRAINED'
78
- output_dir = 'DeepSeek-R1-DRAFT-0.5B'
79
-
80
- # Optimization configuration
81
- full_fine_tune = true
82
- epochs = 1
83
- lr_scheduler = 'cosine'
84
- warmup_steps = 100
85
-
86
- # Performance settings
87
- pipeline_stages = 1
88
- logging_steps = 1
89
- eval_steps = 100
90
- save_steps = 100
91
- checkpoint_every_n_minutes = 60
92
- eval_before_first_step = true
93
- eval_after_last_step = true
94
- model_weight_dtype = 'bfloat16'
95
- keep_states = 3
96
- group_by_length = true
97
- activation_checkpointing = 'unsloth'
98
-
99
- # Dataset configuration
100
- dataset_combination_mode = 'concatenate'
101
- eval_gradient_accumulation_steps = 20
102
-
103
- [optimizer]
104
- type = 'adamw_kahan'
105
- lr = 1e-4
106
- beta1 = 0.9
107
- beta2 = 0.999
108
- weight_decay = 0.01
109
-
110
- [[datasets]]
111
- name = 'mixed_data'
112
- dataset_type = 'textfile'
113
- dataset_path = 'mixed_data/*.txt'
114
- sequence_len = 32768
115
- eval_size = 0.01
116
- ```
117
-
118
- ```json
119
- {
120
- "train_micro_batch_size_per_gpu": 1,
121
- "gradient_accumulation_steps": 20,
122
- "gradient_clipping": 1.0,
123
- "steps_per_print": 1
124
- }
125
- ```
126
-
127
- I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).
128
-
129
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/KueEsBUdLCG17bN2qiEdB.png)
130
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  As you can see `5B` tokens was overkill and around `1-1.5B` would have been OK (the 8-headed `0.33B` model needed at least `2-3B` tokens to recover performance though).
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-0.5B-Instruct
5
+ datasets:
6
+ - agentlans/common-crawl-sample
7
+ - bigcode/the-stack-smol-xl
8
+ - open-thoughts/OpenThoughts-Unverified-173k
9
+ - cognitivecomputations/dolphin-r1
10
+ tags:
11
+ - draft
12
+ - speculative-decoding
13
+ language:
14
+ - zho
15
+ - eng
16
+ - fra
17
+ - spa
18
+ - por
19
+ - deu
20
+ - ita
21
+ - rus
22
+ - jpn
23
+ - kor
24
+ - vie
25
+ - tha
26
+ - ara
27
+ ---
28
+
29
+ ![russian dolls.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/hAb6qi-c0wt4wA5pl4Qup.webp)
30
+
31
+ A `0.5B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).
32
+
33
+ **NOTE**: This is a draft model for the **full-sized** `DeepSeek-R1` model and not the smaller "distilled" models!
34
+
35
+ See [jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) for the models in GGUF format.
36
+
37
+ ---
38
+
39
+ # How the model was created
40
+
41
+ ## 1. The initial model was created from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
42
+
43
+ ```sh
44
+ python ./transplant_vocab.py \
45
+ Qwen2.5-0.5B-Instruct \
46
+ DeepSeek-R1-BF16 \
47
+ DeepSeek-R1-DRAFT-0.5B-UNTRAINED \
48
+ --trim-hidden-size 768 \
49
+ --override "<|▁pad▁|>" "<|endoftext|>" \
50
+ --override "<|fimhole|>" "<|fim_middle|>" \
51
+ --override "<|fim▁begin|>" "<|fim_prefix|>" \
52
+ --override "<|fim▁end|>" "<|fim_suffix|>" \
53
+ --override "<|User|>" "<|im_start|>user\\n" \
54
+ --override "<|Assistant|>" "<|im_start|>assistant\\n" \
55
+ --override "<|EOT|>" "<|endoftext|>" \
56
+ --override "<|tool▁calls▁begin|>" "<tool_call>" \
57
+ --override "<|tool▁call▁begin|>" "<tool_call>" \
58
+ --override "<|tool▁outputs▁begin|>" "<tool_call>" \
59
+ --override "<|tool▁output▁begin|>" "<tool_call>" \
60
+ --override "<|tool▁calls▁end|>" "</tool_call>" \
61
+ --override "<|tool▁call▁end|>" "</tool_call>" \
62
+ --override "<|tool▁outputs▁end|>" "</tool_call>" \
63
+ --override "<|tool▁output▁end|>" "</tool_call>" \
64
+ --override "<|toolsep|>" "</tool_call>"
65
+ ```
66
+
67
+ **NOTE**: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in `top-1` eval was only around 2% (71% vs 73%), and [this small gain is then lost](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF) by being forced to use `Q4_0` which has a ***much*** higher PPL.
68
+
69
+ **NOTE**: I also tried trimming the hidden-size to 512 (and the heads to 8), but the `top-1` eval was significantly lower (63%).
70
+
71
+ ## 2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:
72
+
73
+ - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
74
+ - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
75
+ - [open-thoughts/OpenThoughts-Unverified-173k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-Unverified-173k)
76
+ - [https://huggingface.co/datasets/cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1) *(300k reasoning samples from DeepSeek-R1 only)*
77
+
78
+ **NOTE**: The first two datasets were formatted just between `<|end▁of▁sentence|>` tags, and the second two datasets using the proper `deepseek-r1` Jinga template (with `<think>` tags added around the reasoning, etc).
79
+
80
+ This mix of data was chosen based on the ideas presented in [FastDraft: How to Train Your Draft](https://arxiv.org/abs/2411.11055v1). My first attempt at this did not include the raw-code data from `bigcode/the-stack-smol-xl` and did not perform as well as a result. This confirms their findings:
81
+
82
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/Uwk_KZtsnP9qyabcLEeEM.png)
83
+
84
+ ## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
85
+
86
+ ```toml
87
+ # Resume a prior run
88
+ resume_from_checkpoint = false
89
+
90
+ # Paths
91
+ model = 'DeepSeek-R1-DRAFT-0.5B-UNTRAINED'
92
+ output_dir = 'DeepSeek-R1-DRAFT-0.5B'
93
+
94
+ # Optimization configuration
95
+ full_fine_tune = true
96
+ epochs = 1
97
+ lr_scheduler = 'cosine'
98
+ warmup_steps = 100
99
+
100
+ # Performance settings
101
+ pipeline_stages = 1
102
+ logging_steps = 1
103
+ eval_steps = 100
104
+ save_steps = 100
105
+ checkpoint_every_n_minutes = 60
106
+ eval_before_first_step = true
107
+ eval_after_last_step = true
108
+ model_weight_dtype = 'bfloat16'
109
+ keep_states = 3
110
+ group_by_length = true
111
+ activation_checkpointing = 'unsloth'
112
+
113
+ # Dataset configuration
114
+ dataset_combination_mode = 'concatenate'
115
+ eval_gradient_accumulation_steps = 20
116
+
117
+ [optimizer]
118
+ type = 'adamw_kahan'
119
+ lr = 1e-4
120
+ beta1 = 0.9
121
+ beta2 = 0.999
122
+ weight_decay = 0.01
123
+
124
+ [[datasets]]
125
+ name = 'mixed_data'
126
+ dataset_type = 'textfile'
127
+ dataset_path = 'mixed_data/*.txt'
128
+ sequence_len = 32768
129
+ eval_size = 0.01
130
+ ```
131
+
132
+ ```json
133
+ {
134
+ "train_micro_batch_size_per_gpu": 1,
135
+ "gradient_accumulation_steps": 20,
136
+ "gradient_clipping": 1.0,
137
+ "steps_per_print": 1
138
+ }
139
+ ```
140
+
141
+ I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).
142
+
143
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/KueEsBUdLCG17bN2qiEdB.png)
144
+
145
  As you can see `5B` tokens was overkill and around `1-1.5B` would have been OK (the 8-headed `0.33B` model needed at least `2-3B` tokens to recover performance though).