Upload 4 files
Browse files- README.md +7 -75
- config.json +2 -1
- special_tokens_map.json +2 -2
- tokenizer_config.json +18 -4
README.md
CHANGED
@@ -5,89 +5,21 @@ OpenSeek-Small-v1 is the initial production model of the OpenSeek project.
|
|
5 |
- Utilizes DeepSeek-V3-like MoE architecture.
|
6 |
- Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
|
7 |
- Trained on 720 billion tokens.
|
8 |
-
-
|
9 |
|
10 |
-
|
11 |
-
-
|
12 |
-
|
13 |
-
|-------------------------------------------|---------|
|
14 |
-
| Nemotron-CC-high-actual-actual-high | 1.26 |
|
15 |
-
| Nemotron-CC-high-actual-actual-low | 0.67 |
|
16 |
-
| Nemotron-CC-high-actual-actual-mid | 2.05 |
|
17 |
-
| Nemotron-CC-high-synthetic-distill-high | 1.59 |
|
18 |
-
| Nemotron-CC-high-synthetic-distill-low | 0.64 |
|
19 |
-
| Nemotron-CC-high-synthetic-distill-mid | 2.32 |
|
20 |
-
| Nemotron-CC-high-synthetic-diverse_qa_pairs-high | 4.67 |
|
21 |
-
| Nemotron-CC-high-synthetic-diverse_qa_pairs-low | 2.16 |
|
22 |
-
| Nemotron-CC-high-synthetic-diverse_qa_pairs-mid | 7.58 |
|
23 |
-
| Nemotron-CC-high-synthetic-extract_knowledge-high | 6.43 |
|
24 |
-
| Nemotron-CC-high-synthetic-extract_knowledge-low | 0.07 |
|
25 |
-
| Nemotron-CC-high-synthetic-extract_knowledge-mid | 2.22 |
|
26 |
-
| Nemotron-CC-high-synthetic-knowledge_list-high | 1.88 |
|
27 |
-
| Nemotron-CC-high-synthetic-knowledge_list-low | 0.74 |
|
28 |
-
| Nemotron-CC-high-synthetic-knowledge_list-mid | 3.20 |
|
29 |
-
| Nemotron-CC-high-synthetic-wrap_medium-high | 3.89 |
|
30 |
-
| Nemotron-CC-high-synthetic-wrap_medium-low | 0.65 |
|
31 |
-
| Nemotron-CC-high-synthetic-wrap_medium-mid | 6.18 |
|
32 |
-
| Nemotron-CC-low-synthetic-wrap_medium-high | 0.17 |
|
33 |
-
| Nemotron-CC-low-synthetic-wrap_medium-low | 0.30 |
|
34 |
-
| Nemotron-CC-low-synthetic-wrap_medium-mid | 1.08 |
|
35 |
-
| Nemotron-CC-medium-actual-actual-high | 2.20 |
|
36 |
-
| Nemotron-CC-medium-actual-actual-low | 4.48 |
|
37 |
-
| Nemotron-CC-medium-actual-actual-mid | 7.76 |
|
38 |
-
| arxiv | 0.32 |
|
39 |
-
| books | 1.98 |
|
40 |
-
| code | 3.43 |
|
41 |
-
| cot_synthesis_CC | 9.82 |
|
42 |
-
| cot_synthesis_OpenSource | 0.46 |
|
43 |
-
| cot_synthesis_arxiv | 4.15 |
|
44 |
-
| cot_synthesis_code | 1.32 |
|
45 |
-
| cot_synthesis_math | 2.19 |
|
46 |
-
| cot_synthesis_wiki | 0.83 |
|
47 |
-
| math | 0.83 |
|
48 |
-
| pes2o | 0.31 |
|
49 |
-
| stack | 0.19 |
|
50 |
-
| wiki | 0.29 |
|
51 |
-
| zh_cc | 9.65 |
|
52 |
-
|
53 |
-
## Wandb
|
54 |
-
Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1).
|
55 |
-
|
56 |
-
## Evaluation
|
57 |
-
| Category | Metrics (shots) | Llama-3.2-1B | Qwen2.5-1.5B | Qwen2.5-0.5B | OLMo-1B-0724 | OpenSeek-Small-v1 |
|
58 |
-
|------------------------------|-------------------|--------------|--------------|--------------|---------------|-------------------|
|
59 |
-
| **English-Commonsense Reasoning** | HellaSwag (5-shot) | 0.4830 | 0.5007 | 0.4007 | 0.4909 | 0.3893 |
|
60 |
-
| | TruthfulQA (0-shot) | 0.3773 | 0.4663 | 0.3986 | 0.4029 | 0.3990 |
|
61 |
-
| | Winogrande (5-shot) | 0.6212 | 0.6448 | 0.5683 | 0.6290 | 0.5541 |
|
62 |
-
| | CommonsenseQA (5-shot) | 0.3120 | 0.7445 | 0.5487 | 0.1949 | 0.2048 |
|
63 |
-
| | PIQA (5-shot) | 0.7514 | 0.7612 | 0.7111 | 0.7459 | 0.7203 |
|
64 |
-
| | OpenBookQA (5-shot) | 0.2960 | 0.3340 | 0.2720 | 0.3080 | 0.2560 |
|
65 |
-
| | BoolQ (5-shot) | 0.6590 | 0.7774 | 0.6572 | 0.6508 | 0.6165 |
|
66 |
-
| **English-Problem-Solving** | ARC Easy (5-shot) | 0.6940 | 0.8043 | 0.6780 | 0.6111 | 0.6237 |
|
67 |
-
| | ARC Challenge (5-shot) | 0.3532 | 0.4846 | 0.3370 | 0.3063 | 0.3157 |
|
68 |
-
| | MMLU (5-shot) | 0.3124 | 0.6165 | 0.4818 | 0.2869 | 0.2654 |
|
69 |
-
| **English-Mathematics** | GSM8K (5-shot) | 0.0637 | 0.6194 | 0.3495 | 0.0159 | 0.0182 |
|
70 |
-
| | Minerva Math (4-shot) | 0.0180 | 0.2876 | 0.1160 | 0.0182 | 0.0010 |
|
71 |
-
| **Chinese** | CEval (5-shot) | 0.2779 | 0.6954 | 0.5423 | 0.2340 | 0.2422 |
|
72 |
-
| | CMMLU (5-shot) | 0.2687 | 0.6882 | 0.5300 | 0.2570 | 0.2468 |
|
73 |
-
| **Average Metrics** | **Average-English(w/o Math)** | 0.4859 | 0.6134 | 0.5053 | 0.4627 | 0.4345 |
|
74 |
-
| | **Average-English** | 0.4118 | 0.5868 | 0.4599 | 0.3884 | 0.3637 |
|
75 |
-
| | **Average-Chinese** | 0.2733 | 0.6918 | 0.5362 | 0.2455 | 0.2445 |
|
76 |
-
| | **Average** | 0.3920 | 0.6018 | 0.4708 | 0.3680 | 0.3466 |
|
77 |
-
| | **Average(w/o Math)** | 0.4505 | 0.6265 | 0.5105 | 0.4265 | 0.4028 |
|
78 |
-
|
79 |
-
OpenSeek-Small-v1 demonstrates superior efficiency compared to 1-billion-parameter models.
|
80 |
-
|
81 |
-
- <img src="logC_vs_Metric_Average_scatter_plot.png" alt="logC_vs_Metric_Average" width="400"/>
|
82 |
|
83 |
## Usage Instructions
|
84 |
```python
|
85 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
86 |
|
87 |
-
model = AutoModelForCausalLM.from_pretrained("
|
88 |
-
tokenizer = AutoTokenizer.from_pretrained("
|
89 |
|
90 |
inputs = tokenizer("The future of AI is", return_tensors="pt")
|
|
|
91 |
outputs = model.generate(**inputs, max_length=50)
|
92 |
print(tokenizer.decode(outputs[0]))
|
93 |
```
|
|
|
5 |
- Utilizes DeepSeek-V3-like MoE architecture.
|
6 |
- Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
|
7 |
- Trained on 720 billion tokens.
|
8 |
+
- Completely broken in stock form.
|
9 |
|
10 |
+
Key Fixes in this repository:
|
11 |
+
- Fixed Broken Position Embeddings
|
12 |
+
- Fixed Fundamental Incompatibilities Between Deepseekv3 Model and Qwen Tokenizer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
## Usage Instructions
|
15 |
```python
|
16 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
17 |
|
18 |
+
model = AutoModelForCausalLM.from_pretrained("Robertp423/OpenSeek-Fixed",trust_remote_code=True)
|
19 |
+
tokenizer = AutoTokenizer.from_pretrained("Robertp423/OpenSeek-Fixed",trust_remote_code=True)
|
20 |
|
21 |
inputs = tokenizer("The future of AI is", return_tensors="pt")
|
22 |
+
inputs.pop("token_type_ids", None) # Critical fix
|
23 |
outputs = model.generate(**inputs, max_length=50)
|
24 |
print(tokenizer.decode(outputs[0]))
|
25 |
```
|
config.json
CHANGED
@@ -19,7 +19,6 @@
|
|
19 |
"initializer_range": 0.006,
|
20 |
"intermediate_size": 7168,
|
21 |
"kv_lora_rank": 512,
|
22 |
-
"position_embedding": "rope_64dim",
|
23 |
"max_position_embeddings": 4096,
|
24 |
"model_type": "deepseek_v3",
|
25 |
"moe_intermediate_size": 896,
|
@@ -34,6 +33,8 @@
|
|
34 |
"num_key_value_heads": 10,
|
35 |
"num_nextn_predict_layers": 1,
|
36 |
"pretraining_tp": 1,
|
|
|
|
|
37 |
"q_lora_rank": null,
|
38 |
"qk_nope_head_dim": 128,
|
39 |
"qk_rope_head_dim": 64,
|
|
|
19 |
"initializer_range": 0.006,
|
20 |
"intermediate_size": 7168,
|
21 |
"kv_lora_rank": 512,
|
|
|
22 |
"max_position_embeddings": 4096,
|
23 |
"model_type": "deepseek_v3",
|
24 |
"moe_intermediate_size": 896,
|
|
|
33 |
"num_key_value_heads": 10,
|
34 |
"num_nextn_predict_layers": 1,
|
35 |
"pretraining_tp": 1,
|
36 |
+
"problematic_params": ["token_type_ids"],
|
37 |
+
"position_embedding_type": "rope",
|
38 |
"q_lora_rank": null,
|
39 |
"qk_nope_head_dim": 128,
|
40 |
"qk_rope_head_dim": 64,
|
special_tokens_map.json
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
{
|
2 |
"bos_token": "<|extra_203|>",
|
3 |
"eos_token": "<|extra_204|>",
|
4 |
-
"unk_token": "
|
5 |
-
"pad_token": "
|
6 |
}
|
|
|
1 |
{
|
2 |
"bos_token": "<|extra_203|>",
|
3 |
"eos_token": "<|extra_204|>",
|
4 |
+
"unk_token": "<|endoftext|>",
|
5 |
+
"pad_token": "<|endoftext|>"
|
6 |
}
|
tokenizer_config.json
CHANGED
@@ -1,11 +1,25 @@
|
|
1 |
{
|
2 |
-
"model_max_length":
|
3 |
"tokenizer_class": "QWenTokenizer",
|
4 |
"auto_map": {
|
5 |
"AutoTokenizer": [
|
6 |
"tokenization_qwen.QWenTokenizer",
|
7 |
null
|
8 |
-
|
9 |
},
|
10 |
-
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
{
|
2 |
+
"model_max_length": 4096,
|
3 |
"tokenizer_class": "QWenTokenizer",
|
4 |
"auto_map": {
|
5 |
"AutoTokenizer": [
|
6 |
"tokenization_qwen.QWenTokenizer",
|
7 |
null
|
8 |
+
]
|
9 |
},
|
10 |
+
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
11 |
+
"added_tokens_map": {
|
12 |
+
"<|endoftext|>": {
|
13 |
+
"content": "<|endoftext|>",
|
14 |
+
"single_word": false
|
15 |
+
},
|
16 |
+
"<|im_start|>": {
|
17 |
+
"content": "<|im_start|>",
|
18 |
+
"single_word": false
|
19 |
+
},
|
20 |
+
"<|im_end|>": {
|
21 |
+
"content": "<|im_end|>",
|
22 |
+
"single_word": false
|
23 |
+
}
|
24 |
+
}
|
25 |
+
}
|