duyntnet commited on
Commit
64c4a71
·
verified ·
1 Parent(s): e0163db

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +236 -0
README.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - cogito-v1-preview-llama-3B
12
+ ---
13
+ Quantizations of https://huggingface.co/deepcogito/cogito-v1-preview-llama-3B
14
+
15
+
16
+ ### Open source inference clients/UIs
17
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
18
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
19
+ * [ollama](https://github.com/ollama/ollama)
20
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
21
+ * [jan](https://github.com/janhq/jan)
22
+ * [GPT4All](https://github.com/nomic-ai/gpt4all)
23
+
24
+ ### Closed source inference clients/UIs
25
+ * [LM Studio](https://lmstudio.ai/)
26
+ * [Backyard AI](https://backyard.ai/)
27
+ * More will be added...
28
+ ---
29
+
30
+ # From original readme
31
+
32
+ The Cogito LLMs are instruction tuned generative models (text in/text out). All models are released under an open license for commercial use.
33
+
34
+ - Cogito models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).
35
+ - The LLMs are trained using **Iterated Distillation and Amplification (IDA)** - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement.
36
+ - The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts.
37
+ - In both standard and reasoning modes, Cogito v1-preview models outperform their size equivalent counterparts on common industry benchmarks.
38
+ - Each model is trained in over 30 languages and supports a context length of 128k.
39
+
40
+
41
+ # Usage
42
+ Here is a snippet below for usage with Transformers:
43
+
44
+ ```python
45
+ import transformers
46
+ import torch
47
+
48
+ model_id = "deepcogito/cogito-v1-preview-llama-3B"
49
+
50
+ pipeline = transformers.pipeline(
51
+ "text-generation",
52
+ model=model_id,
53
+ model_kwargs={"torch_dtype": torch.bfloat16},
54
+ device_map="auto",
55
+ )
56
+
57
+ messages = [
58
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
59
+ {"role": "user", "content": "Give me a short introduction to LLMs."},
60
+ ]
61
+
62
+ outputs = pipeline(
63
+ messages,
64
+ max_new_tokens=512,
65
+ )
66
+
67
+ print(outputs[0]["generated_text"][-1])
68
+ ```
69
+
70
+
71
+
72
+ ## Implementing extended thinking
73
+ - By default, the model will answer in the standard mode.
74
+ - To enable thinking, you can do any one of the two methods:
75
+ - Add a specific system prompt, or
76
+ - Set `enable_thinking=True` while applying the chat template.
77
+
78
+ > **_NOTE:_** For the Cogito 3B model, we suggest using `repetition_penalty=1.1` while implementing extended thinking.
79
+
80
+ ### Method 1 - Add a specific system prompt.
81
+ To enable thinking, simply use this in the system prompt `system_instruction = 'Enable deep thinking subroutine.'`
82
+
83
+ If you already have a system_instruction, then use `system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction`.
84
+
85
+ Here is an example -
86
+
87
+ ```python
88
+ import transformers
89
+ import torch
90
+
91
+ model_id = "deepcogito/cogito-v1-preview-llama-3B"
92
+
93
+ pipeline = transformers.pipeline(
94
+ "text-generation",
95
+ model=model_id,
96
+ model_kwargs={"torch_dtype": torch.bfloat16},
97
+ device_map="auto",
98
+ )
99
+
100
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
101
+
102
+ messages = [
103
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION},
104
+ {"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
105
+ ]
106
+
107
+ outputs = pipeline(
108
+ messages,
109
+ max_new_tokens=512,
110
+ )
111
+
112
+ print(outputs[0]["generated_text"][-1])
113
+ ```
114
+
115
+
116
+ Similarly, if you have a system prompt, you can append the `DEEP_THINKING_INSTRUCTION` to the beginning in this way -
117
+
118
+ ```python
119
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
120
+
121
+ system_prompt = "Reply to each prompt with only the actual code - no explanations."
122
+ prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."
123
+
124
+ messages = [
125
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
126
+ {"role": "user", "content": prompt}
127
+ ]
128
+ ```
129
+
130
+ ### Method 2 - Set enable_thinking=True in the tokenizer
131
+ If you are using Huggingface tokenizers, then you can simply use add the argument `enable_thinking=True` to the tokenization (this option is added to the chat template).
132
+
133
+ Here is an example -
134
+ ```python
135
+ from transformers import AutoModelForCausalLM, AutoTokenizer
136
+
137
+ model_name = "deepcogito/cogito-v1-preview-llama-3B"
138
+
139
+ model = AutoModelForCausalLM.from_pretrained(
140
+ model_name,
141
+ torch_dtype="auto",
142
+ device_map="auto"
143
+ )
144
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
145
+
146
+ prompt = "Give me a short introduction to LLMs."
147
+ messages = [
148
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
149
+ {"role": "user", "content": prompt}
150
+ ]
151
+
152
+ text = tokenizer.apply_chat_template(
153
+ messages,
154
+ tokenize=False,
155
+ add_generation_prompt=True,
156
+ enable_thinking=True
157
+ )
158
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
159
+
160
+ generated_ids = model.generate(
161
+ **model_inputs,
162
+ max_new_tokens=512
163
+ )
164
+ generated_ids = [
165
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
166
+ ]
167
+
168
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
169
+ print(response)
170
+ ```
171
+
172
+ # Tool Calling
173
+ Cogito models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.
174
+
175
+ Here is a snippet -
176
+
177
+ ```python
178
+ # First, define a tool
179
+ def get_current_temperature(location: str) -> float:
180
+ """
181
+ Get the current temperature at a location.
182
+
183
+ Args:
184
+ location: The location to get the temperature for, in the format "City, Country"
185
+ Returns:
186
+ The current temperature at the specified location in the specified units, as a float.
187
+ """
188
+ return 22. # A real function should probably actually get the temperature!
189
+
190
+ # Next, create a chat and apply the chat template
191
+ messages = [
192
+ {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
193
+ ]
194
+
195
+ model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
196
+
197
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
198
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
199
+ outputs = model.generate(**inputs, max_new_tokens=512)
200
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
201
+ print(output_text)
202
+ ```
203
+
204
+ This will result in the output -
205
+ ```
206
+ <tool_call>
207
+ {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
208
+ </tool_call><|eot_id|>
209
+ ```
210
+
211
+ You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
212
+
213
+ ```python
214
+ tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
215
+ messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
216
+ ```
217
+
218
+ and then call the tool and append the result, with the `tool` role, like so:
219
+
220
+ ```python
221
+ messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
222
+ ```
223
+
224
+ After that, you can `generate()` again to let the model use the tool result in the chat:
225
+
226
+ ```python
227
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
228
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
229
+ outputs = model.generate(**inputs, max_new_tokens=512)
230
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
231
+ ```
232
+
233
+ This should result in the string -
234
+ ```
235
+ 'The current temperature in Paris is 22.0 degrees.<|eot_id|>'
236
+ ```