duyntnet commited on
Commit
d305398
·
verified ·
1 Parent(s): f35495d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +235 -0
README.md ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - cogito-v1-preview-llama-8B
12
+ ---
13
+ Quantizations of https://huggingface.co/deepcogito/cogito-v1-preview-llama-8B
14
+
15
+
16
+ ### Open source inference clients/UIs
17
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
18
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
19
+ * [ollama](https://github.com/ollama/ollama)
20
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
21
+ * [jan](https://github.com/janhq/jan)
22
+ * [GPT4All](https://github.com/nomic-ai/gpt4all)
23
+
24
+ ### Closed source inference clients/UIs
25
+ * [LM Studio](https://lmstudio.ai/)
26
+ * [Backyard AI](https://backyard.ai/)
27
+ * More will be added...
28
+ ---
29
+
30
+ # From original readme
31
+
32
+ The Cogito LLMs are instruction tuned generative models (text in/text out). All models are released under an open license for commercial use.
33
+
34
+ - Cogito models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).
35
+ - The LLMs are trained using **Iterated Distillation and Amplification (IDA)** - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement.
36
+ - The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts.
37
+ - In both standard and reasoning modes, Cogito v1-preview models outperform their size equivalent counterparts on common industry benchmarks.
38
+ - Each model is trained in over 30 languages and supports a context length of 128k.
39
+
40
+
41
+ # Usage
42
+ Here is a snippet below for usage with Transformers:
43
+
44
+ ```python
45
+ import transformers
46
+ import torch
47
+
48
+ model_id = "deepcogito/cogito-v1-preview-llama-8B"
49
+
50
+ pipeline = transformers.pipeline(
51
+ "text-generation",
52
+ model=model_id,
53
+ model_kwargs={"torch_dtype": torch.bfloat16},
54
+ device_map="auto",
55
+ )
56
+
57
+ messages = [
58
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
59
+ {"role": "user", "content": "Give me a short introduction to LLMs."},
60
+ ]
61
+
62
+ outputs = pipeline(
63
+ messages,
64
+ max_new_tokens=512,
65
+ )
66
+
67
+ print(outputs[0]["generated_text"][-1])
68
+ ```
69
+
70
+
71
+
72
+ ## Implementing extended thinking
73
+ - By default, the model will answer in the standard mode.
74
+ - To enable thinking, you can do any one of the two methods:
75
+ - Add a specific system prompt, or
76
+ - Set `enable_thinking=True` while applying the chat template.
77
+
78
+
79
+ ### Method 1 - Add a specific system prompt.
80
+ To enable thinking, simply use this in the system prompt `system_instruction = 'Enable deep thinking subroutine.'`
81
+
82
+ If you already have a system_instruction, then use `system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction`.
83
+
84
+ Here is an example -
85
+
86
+ ```python
87
+ import transformers
88
+ import torch
89
+
90
+ model_id = "deepcogito/cogito-v1-preview-llama-8B"
91
+
92
+ pipeline = transformers.pipeline(
93
+ "text-generation",
94
+ model=model_id,
95
+ model_kwargs={"torch_dtype": torch.bfloat16},
96
+ device_map="auto",
97
+ )
98
+
99
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
100
+
101
+ messages = [
102
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION},
103
+ {"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
104
+ ]
105
+
106
+ outputs = pipeline(
107
+ messages,
108
+ max_new_tokens=512,
109
+ )
110
+
111
+ print(outputs[0]["generated_text"][-1])
112
+ ```
113
+
114
+
115
+ Similarly, if you have a system prompt, you can append the `DEEP_THINKING_INSTRUCTION` to the beginning in this way -
116
+
117
+ ```python
118
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
119
+
120
+ system_prompt = "Reply to each prompt with only the actual code - no explanations."
121
+ prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."
122
+
123
+ messages = [
124
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
125
+ {"role": "user", "content": prompt}
126
+ ]
127
+ ```
128
+
129
+ ### Method 2 - Set enable_thinking=True in the tokenizer
130
+ If you are using Huggingface tokenizers, then you can simply use add the argument `enable_thinking=True` to the tokenization (this option is added to the chat template).
131
+
132
+ Here is an example -
133
+ ```python
134
+ from transformers import AutoModelForCausalLM, AutoTokenizer
135
+
136
+ model_name = "deepcogito/cogito-v1-preview-llama-8B"
137
+
138
+ model = AutoModelForCausalLM.from_pretrained(
139
+ model_name,
140
+ torch_dtype="auto",
141
+ device_map="auto"
142
+ )
143
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
144
+
145
+ prompt = "Give me a short introduction to LLMs."
146
+ messages = [
147
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
148
+ {"role": "user", "content": prompt}
149
+ ]
150
+
151
+ text = tokenizer.apply_chat_template(
152
+ messages,
153
+ tokenize=False,
154
+ add_generation_prompt=True,
155
+ enable_thinking=True
156
+ )
157
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
158
+
159
+ generated_ids = model.generate(
160
+ **model_inputs,
161
+ max_new_tokens=512
162
+ )
163
+ generated_ids = [
164
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
165
+ ]
166
+
167
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
168
+ print(response)
169
+ ```
170
+
171
+ # Tool Calling
172
+ Cogito models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.
173
+
174
+ Here is a snippet -
175
+
176
+ ```python
177
+ # First, define a tool
178
+ def get_current_temperature(location: str) -> float:
179
+ """
180
+ Get the current temperature at a location.
181
+
182
+ Args:
183
+ location: The location to get the temperature for, in the format "City, Country"
184
+ Returns:
185
+ The current temperature at the specified location in the specified units, as a float.
186
+ """
187
+ return 22. # A real function should probably actually get the temperature!
188
+
189
+ # Next, create a chat and apply the chat template
190
+ messages = [
191
+ {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
192
+ ]
193
+
194
+ model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
195
+
196
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
197
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
198
+ outputs = model.generate(**inputs, max_new_tokens=512)
199
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
200
+ print(output_text)
201
+ ```
202
+
203
+ This will result in the output -
204
+ ```
205
+ <tool_call>
206
+ {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
207
+ </tool_call><|eot_id|>
208
+ ```
209
+
210
+ You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
211
+
212
+ ```python
213
+ tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
214
+ messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
215
+ ```
216
+
217
+ and then call the tool and append the result, with the `tool` role, like so:
218
+
219
+ ```python
220
+ messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
221
+ ```
222
+
223
+ After that, you can `generate()` again to let the model use the tool result in the chat:
224
+
225
+ ```python
226
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
227
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
228
+ outputs = model.generate(**inputs, max_new_tokens=512)
229
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
230
+ ```
231
+
232
+ This should result in the string -
233
+ ```
234
+ 'The current temperature in Paris is 22.0 degrees.<|eot_id|>'
235
+ ```