quwsarohi commited on
Commit
ecc16a5
·
verified ·
1 Parent(s): 824b7cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -3
README.md CHANGED
@@ -13,10 +13,15 @@ tags:
13
  - R1
14
  - CoT
15
  ---
 
16
 
17
- SmolThink model is a Continued Supervised Fine-Tuned version of [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) on Deepseek R1 distilled dataset.
18
 
19
- The model was trained on a mixture of small Chain of Thoughts (CoT) and some long CoT dataset. Small CoT was used as the model is small and it is was reported that small models struggle to produce long reasoning chain [ref](https://arxiv.org/abs/2502.12143).
 
 
 
 
20
 
21
  The SFT dataset had been created from the following data mixtures:
22
 
@@ -25,13 +30,21 @@ The SFT dataset had been created from the following data mixtures:
25
  * [GeneralReasoning/GeneralThought-195K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K)
26
  * [open-r1/codeforces-cots](https://huggingface.co/datasets/open-r1/codeforces-cots)
27
  * [XeTute/Open-Coding-Thoughts (currently unavailable)](https://huggingface.co/datasets/XeTute/Open-Coding-Thoughts)
 
28
 
29
  The datasets were filtered by removing the contents having CoT length more than 256 words. The model was trained to produce tool calls. As being a small language model, the model does not memorize certain things (example: how to bake cake). Rather, if used with web search, the model may produce good quality of answer regardless of the size.
30
 
 
 
 
 
31
  The model is still under training and the whole training and dataset mixtures would be published soon. The model is trained on MacBook Air with 16GB unified memory.
32
 
33
- Use the following code to load the model
 
 
34
 
 
35
  ```python
36
  from transformers import AutoModelForCausalLM, AutoTokenizer
37
  import torch
@@ -54,6 +67,39 @@ model = AutoModelForCausalLM.from_pretrained(
54
  messages = [{"role": "user", "content": "What is the capital of France."}]
55
  input_text=tokenizer.apply_chat_template(messages, tokenize=False)
56
  print(input_text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
58
  outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
59
  print(tokenizer.decode(outputs[0]))
 
13
  - R1
14
  - CoT
15
  ---
16
+ # SmolThink: A Small Model That Tries to Think
17
 
18
+ **SmolThink** model is a Continued Supervised Fine-Tuned version of [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) on **Deepseek-R1** distilled dataset.
19
 
20
+ Training code and a portion of dataset can be found [QuwsarOhi/SmolThink](https://github.com/QuwsarOhi/SmolThink)
21
+
22
+ ## Trainin Process
23
+
24
+ The model was trained on a mixture of small Chain of Thoughts (CoT) and some long CoT dataset. Small CoT dataset mixture was used as the model is small and it is was reported that small models struggle to produce long reasoning chain [ref](https://arxiv.org/abs/2502.12143).
25
 
26
  The SFT dataset had been created from the following data mixtures:
27
 
 
30
  * [GeneralReasoning/GeneralThought-195K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K)
31
  * [open-r1/codeforces-cots](https://huggingface.co/datasets/open-r1/codeforces-cots)
32
  * [XeTute/Open-Coding-Thoughts (currently unavailable)](https://huggingface.co/datasets/XeTute/Open-Coding-Thoughts)
33
+ * Custom tool calling and websearch summarizing dataset generated using **Phi-3.5**, **Qwen2: deepseek-r1:7b**. Data generation pipeline is described [here](#)
34
 
35
  The datasets were filtered by removing the contents having CoT length more than 256 words. The model was trained to produce tool calls. As being a small language model, the model does not memorize certain things (example: how to bake cake). Rather, if used with web search, the model may produce good quality of answer regardless of the size.
36
 
37
+ The model was supervised fine-tuned in two phases:
38
+ * The model was initially trained with a rolling context length of `832` with a token stride of `832/8`. Dataset snapshot is available as [merged_dataset_phase1](#).
39
+ * The model was again trained with a rolling context length of `3072` with a token stride of `768`. Dataset snapshot is available as [merged_dataset_phase2](#)
40
+
41
  The model is still under training and the whole training and dataset mixtures would be published soon. The model is trained on MacBook Air with 16GB unified memory.
42
 
43
+ ## Usage
44
+
45
+ ### General Usage
46
 
47
+ Use the following code to load and use the model
48
  ```python
49
  from transformers import AutoModelForCausalLM, AutoTokenizer
50
  import torch
 
67
  messages = [{"role": "user", "content": "What is the capital of France."}]
68
  input_text=tokenizer.apply_chat_template(messages, tokenize=False)
69
  print(input_text)
70
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
71
+ outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
72
+ print(tokenizer.decode(outputs[0]))
73
+ ```
74
+
75
+ ### WebSearch Tool Integration
76
+
77
+ The model is further trained to do web search using a special websearch tool. The following code could be used to use the web searching capability.
78
+
79
+ ```python
80
+ webtool_def = {
81
+ "type": "function",
82
+ "function": {
83
+ "name": "web_search",
84
+ "description": "Can search the web for infomation which are doubtful/unknown/recent.",
85
+ "parameters": {
86
+ "type": "object",
87
+ "properties": {
88
+ "search_str": {
89
+ "type": "string",
90
+ "description": "The whole question you want to ask.",
91
+ "required": True,
92
+ }
93
+ },
94
+ },
95
+ },
96
+ }
97
+
98
+ base_prompt = tokenizer.apply_chat_template([
99
+ {"role": "user", "content": "What is the current stock price of Apple?"}
100
+ ], tools=[webtool_def], tokenize=False, add_generation_prompt=True)
101
+ print(base_prompt)
102
+
103
  inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
104
  outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
105
  print(tokenizer.decode(outputs[0]))