Files changed (1) hide show
  1. README.md +117 -105
README.md CHANGED
@@ -1,106 +1,118 @@
1
- ---
2
- datasets:
3
- - PowerInfer/QWQ-LONGCOT-500K
4
- - PowerInfer/LONGCOT-Refine-500K
5
- base_model:
6
- - Qwen/Qwen2.5-3B-Instruct
7
- pipeline_tag: text-generation
8
- language:
9
- - en
10
- library_name: transformers
11
- ---
12
- # SmallThinker-3B-preview
13
-
14
- We introduce **SmallThinker-3B-preview**, a new model fine-tuned from the [Qwen2.5-3b-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) model.
15
-
16
- Now you can directly deploy SmallThinker On your phones with [PowerServe](https://github.com/powerserve-project/PowerServe).
17
-
18
- ## Benchmark Performance
19
-
20
- | Model | AIME24 | AMC23 | GAOKAO2024_I | GAOKAO2024_II | MMLU_STEM | AMPS_Hard | math_comp |
21
- |---------|--------|-------|--------------|---------------|-----------|-----------|-----------|
22
- | Qwen2.5-3B-Instruct | 6.67 | 45 | 50 | 35.8 | 59.8 | - | - |
23
- | SmallThinker | 16.667 | 57.5 | 64.2 | 57.1 | 68.2 | 70 | 46.8 |
24
- | GPT-4o | 9.3 | - | - | - | 64.2 | 57 | 50 |
25
-
26
- Limitation: Due to SmallThinker's current limitations in instruction following, for math_comp we adopt a more lenient evaluation method where only correct answers are required, without constraining responses to follow the specified AAAAA format.
27
-
28
- Colab Link: [Colab](https://colab.research.google.com/drive/182q600at0sVw7uX0SXFp6bQI7pyjWXQ2?usp=sharing)
29
- ## Intended Use Cases
30
-
31
- SmallThinker is designed for the following use cases:
32
-
33
- 1. **Edge Deployment:** Its small size makes it ideal for deployment on resource-constrained devices.
34
- 2. **Draft Model for QwQ-32B-Preview:** SmallThinker can serve as a fast and efficient draft model for the larger QwQ-32B-Preview model. From my test, in llama.cpp we can get 70% speedup (from 40 tokens/s to 70 tokens/s).
35
-
36
- ## Training Details
37
-
38
- The model was trained using 8 H100 GPUs with a global batch size of 16. The specific configuration is as follows:
39
-
40
- The SFT (Supervised Fine-Tuning) process was conducted in two phases:
41
-
42
- 1. First Phase:
43
- - Used only the PowerInfer/QWQ-LONGCOT-500K dataset
44
- - Trained for 1.5 epochs
45
- ```
46
- ### model
47
- model_name_or_path: /home/syx/Qwen2.5-3B-Instruct
48
-
49
- ### method
50
- stage: sft
51
- do_train: true
52
- finetuning_type: full
53
- deepspeed: examples/deepspeed/ds_z3_config.json
54
-
55
- ### dataset
56
- dataset: o1-v2
57
- template: qwen
58
- neat_packing: true
59
- cutoff_len: 16384
60
- overwrite_cache: true
61
- preprocessing_num_workers: 16
62
-
63
- ### output
64
- output_dir: saves/qwen2-01-qat/full/sft
65
- logging_steps: 1
66
- save_steps: 1000
67
- plot_loss: true
68
- overwrite_output_dir: true
69
- ```
70
- 2. Second Phase:
71
- - Combined training with PowerInfer/QWQ-LONGCOT-500K and PowerInfer/LONGCOT-Refine datasets
72
- - Continued training for 2 additional epochs
73
- ```
74
- ### model
75
- model_name_or_path: saves/qwen2-01-qat/full/sft/checkpoint-24000
76
-
77
- ### method
78
- stage: sft
79
- do_train: true
80
- finetuning_type: full
81
- deepspeed: examples/deepspeed/ds_z3_config.json
82
-
83
- ### dataset
84
- dataset: o1-v2, o1-v3
85
- template: qwen
86
- neat_packing: true
87
- cutoff_len: 16384
88
- overwrite_cache: true
89
- preprocessing_num_workers: 16
90
-
91
- ### output
92
- output_dir: saves/qwen2-01-qat/full/sft
93
- logging_steps: 1
94
- save_steps: 1000
95
- plot_loss: true
96
- overwrite_output_dir: true
97
- ```
98
-
99
- ## Limitations & Disclaimer
100
-
101
- Please be aware of the following limitations:
102
-
103
- * **Language Limitation:** The model has only been trained on English-language datasets, hence its capabilities in other languages are still lacking.
104
- * **Limited Knowledge:** Due to limited SFT data and the model's relatively small scale, its reasoning capabilities are constrained by its knowledge base.
105
- * **Unpredictable Outputs:** The model may produce unexpected outputs due to its size and probabilistic generation paradigm. Users should exercise caution and validate the model's responses.
 
 
 
 
 
 
 
 
 
 
 
 
106
  * **Repetition Issue:** The model tends to repeat itself when answering high-difficulty questions. Please increase the `repetition_penalty` to mitigate this issue.
 
1
+ ---
2
+ datasets:
3
+ - PowerInfer/QWQ-LONGCOT-500K
4
+ - PowerInfer/LONGCOT-Refine-500K
5
+ base_model:
6
+ - Qwen/Qwen2.5-3B-Instruct
7
+ pipeline_tag: text-generation
8
+ language:
9
+ - zho
10
+ - eng
11
+ - fra
12
+ - spa
13
+ - por
14
+ - deu
15
+ - ita
16
+ - rus
17
+ - jpn
18
+ - kor
19
+ - vie
20
+ - tha
21
+ - ara
22
+ library_name: transformers
23
+ ---
24
+ # SmallThinker-3B-preview
25
+
26
+ We introduce **SmallThinker-3B-preview**, a new model fine-tuned from the [Qwen2.5-3b-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) model.
27
+
28
+ Now you can directly deploy SmallThinker On your phones with [PowerServe](https://github.com/powerserve-project/PowerServe).
29
+
30
+ ## Benchmark Performance
31
+
32
+ | Model | AIME24 | AMC23 | GAOKAO2024_I | GAOKAO2024_II | MMLU_STEM | AMPS_Hard | math_comp |
33
+ |---------|--------|-------|--------------|---------------|-----------|-----------|-----------|
34
+ | Qwen2.5-3B-Instruct | 6.67 | 45 | 50 | 35.8 | 59.8 | - | - |
35
+ | SmallThinker | 16.667 | 57.5 | 64.2 | 57.1 | 68.2 | 70 | 46.8 |
36
+ | GPT-4o | 9.3 | - | - | - | 64.2 | 57 | 50 |
37
+
38
+ Limitation: Due to SmallThinker's current limitations in instruction following, for math_comp we adopt a more lenient evaluation method where only correct answers are required, without constraining responses to follow the specified AAAAA format.
39
+
40
+ Colab Link: [Colab](https://colab.research.google.com/drive/182q600at0sVw7uX0SXFp6bQI7pyjWXQ2?usp=sharing)
41
+ ## Intended Use Cases
42
+
43
+ SmallThinker is designed for the following use cases:
44
+
45
+ 1. **Edge Deployment:** Its small size makes it ideal for deployment on resource-constrained devices.
46
+ 2. **Draft Model for QwQ-32B-Preview:** SmallThinker can serve as a fast and efficient draft model for the larger QwQ-32B-Preview model. From my test, in llama.cpp we can get 70% speedup (from 40 tokens/s to 70 tokens/s).
47
+
48
+ ## Training Details
49
+
50
+ The model was trained using 8 H100 GPUs with a global batch size of 16. The specific configuration is as follows:
51
+
52
+ The SFT (Supervised Fine-Tuning) process was conducted in two phases:
53
+
54
+ 1. First Phase:
55
+ - Used only the PowerInfer/QWQ-LONGCOT-500K dataset
56
+ - Trained for 1.5 epochs
57
+ ```
58
+ ### model
59
+ model_name_or_path: /home/syx/Qwen2.5-3B-Instruct
60
+
61
+ ### method
62
+ stage: sft
63
+ do_train: true
64
+ finetuning_type: full
65
+ deepspeed: examples/deepspeed/ds_z3_config.json
66
+
67
+ ### dataset
68
+ dataset: o1-v2
69
+ template: qwen
70
+ neat_packing: true
71
+ cutoff_len: 16384
72
+ overwrite_cache: true
73
+ preprocessing_num_workers: 16
74
+
75
+ ### output
76
+ output_dir: saves/qwen2-01-qat/full/sft
77
+ logging_steps: 1
78
+ save_steps: 1000
79
+ plot_loss: true
80
+ overwrite_output_dir: true
81
+ ```
82
+ 2. Second Phase:
83
+ - Combined training with PowerInfer/QWQ-LONGCOT-500K and PowerInfer/LONGCOT-Refine datasets
84
+ - Continued training for 2 additional epochs
85
+ ```
86
+ ### model
87
+ model_name_or_path: saves/qwen2-01-qat/full/sft/checkpoint-24000
88
+
89
+ ### method
90
+ stage: sft
91
+ do_train: true
92
+ finetuning_type: full
93
+ deepspeed: examples/deepspeed/ds_z3_config.json
94
+
95
+ ### dataset
96
+ dataset: o1-v2, o1-v3
97
+ template: qwen
98
+ neat_packing: true
99
+ cutoff_len: 16384
100
+ overwrite_cache: true
101
+ preprocessing_num_workers: 16
102
+
103
+ ### output
104
+ output_dir: saves/qwen2-01-qat/full/sft
105
+ logging_steps: 1
106
+ save_steps: 1000
107
+ plot_loss: true
108
+ overwrite_output_dir: true
109
+ ```
110
+
111
+ ## Limitations & Disclaimer
112
+
113
+ Please be aware of the following limitations:
114
+
115
+ * **Language Limitation:** The model has only been trained on English-language datasets, hence its capabilities in other languages are still lacking.
116
+ * **Limited Knowledge:** Due to limited SFT data and the model's relatively small scale, its reasoning capabilities are constrained by its knowledge base.
117
+ * **Unpredictable Outputs:** The model may produce unexpected outputs due to its size and probabilistic generation paradigm. Users should exercise caution and validate the model's responses.
118
  * **Repetition Issue:** The model tends to repeat itself when answering high-difficulty questions. Please increase the `repetition_penalty` to mitigate this issue.