Robertp423 commited on
Commit
2d9cfed
·
verified ·
1 Parent(s): f102bd1

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +7 -75
  2. config.json +2 -1
  3. special_tokens_map.json +2 -2
  4. tokenizer_config.json +18 -4
README.md CHANGED
@@ -5,89 +5,21 @@ OpenSeek-Small-v1 is the initial production model of the OpenSeek project.
5
  - Utilizes DeepSeek-V3-like MoE architecture.
6
  - Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
7
  - Trained on 720 billion tokens.
8
- - Demonstrates superior efficiency compared to 1-billion-parameter models.
9
 
10
- ## Training Data
11
- - 0.72TB tokens of high-quality pretraining data and the ratio for each domain is as follows:
12
- | Name | Ratio |
13
- |-------------------------------------------|---------|
14
- | Nemotron-CC-high-actual-actual-high | 1.26 |
15
- | Nemotron-CC-high-actual-actual-low | 0.67 |
16
- | Nemotron-CC-high-actual-actual-mid | 2.05 |
17
- | Nemotron-CC-high-synthetic-distill-high | 1.59 |
18
- | Nemotron-CC-high-synthetic-distill-low | 0.64 |
19
- | Nemotron-CC-high-synthetic-distill-mid | 2.32 |
20
- | Nemotron-CC-high-synthetic-diverse_qa_pairs-high | 4.67 |
21
- | Nemotron-CC-high-synthetic-diverse_qa_pairs-low | 2.16 |
22
- | Nemotron-CC-high-synthetic-diverse_qa_pairs-mid | 7.58 |
23
- | Nemotron-CC-high-synthetic-extract_knowledge-high | 6.43 |
24
- | Nemotron-CC-high-synthetic-extract_knowledge-low | 0.07 |
25
- | Nemotron-CC-high-synthetic-extract_knowledge-mid | 2.22 |
26
- | Nemotron-CC-high-synthetic-knowledge_list-high | 1.88 |
27
- | Nemotron-CC-high-synthetic-knowledge_list-low | 0.74 |
28
- | Nemotron-CC-high-synthetic-knowledge_list-mid | 3.20 |
29
- | Nemotron-CC-high-synthetic-wrap_medium-high | 3.89 |
30
- | Nemotron-CC-high-synthetic-wrap_medium-low | 0.65 |
31
- | Nemotron-CC-high-synthetic-wrap_medium-mid | 6.18 |
32
- | Nemotron-CC-low-synthetic-wrap_medium-high | 0.17 |
33
- | Nemotron-CC-low-synthetic-wrap_medium-low | 0.30 |
34
- | Nemotron-CC-low-synthetic-wrap_medium-mid | 1.08 |
35
- | Nemotron-CC-medium-actual-actual-high | 2.20 |
36
- | Nemotron-CC-medium-actual-actual-low | 4.48 |
37
- | Nemotron-CC-medium-actual-actual-mid | 7.76 |
38
- | arxiv | 0.32 |
39
- | books | 1.98 |
40
- | code | 3.43 |
41
- | cot_synthesis_CC | 9.82 |
42
- | cot_synthesis_OpenSource | 0.46 |
43
- | cot_synthesis_arxiv | 4.15 |
44
- | cot_synthesis_code | 1.32 |
45
- | cot_synthesis_math | 2.19 |
46
- | cot_synthesis_wiki | 0.83 |
47
- | math | 0.83 |
48
- | pes2o | 0.31 |
49
- | stack | 0.19 |
50
- | wiki | 0.29 |
51
- | zh_cc | 9.65 |
52
-
53
- ## Wandb
54
- Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1).
55
-
56
- ## Evaluation
57
- | Category | Metrics (shots) | Llama-3.2-1B | Qwen2.5-1.5B | Qwen2.5-0.5B | OLMo-1B-0724 | OpenSeek-Small-v1 |
58
- |------------------------------|-------------------|--------------|--------------|--------------|---------------|-------------------|
59
- | **English-Commonsense Reasoning** | HellaSwag (5-shot) | 0.4830 | 0.5007 | 0.4007 | 0.4909 | 0.3893 |
60
- | | TruthfulQA (0-shot) | 0.3773 | 0.4663 | 0.3986 | 0.4029 | 0.3990 |
61
- | | Winogrande (5-shot) | 0.6212 | 0.6448 | 0.5683 | 0.6290 | 0.5541 |
62
- | | CommonsenseQA (5-shot) | 0.3120 | 0.7445 | 0.5487 | 0.1949 | 0.2048 |
63
- | | PIQA (5-shot) | 0.7514 | 0.7612 | 0.7111 | 0.7459 | 0.7203 |
64
- | | OpenBookQA (5-shot) | 0.2960 | 0.3340 | 0.2720 | 0.3080 | 0.2560 |
65
- | | BoolQ (5-shot) | 0.6590 | 0.7774 | 0.6572 | 0.6508 | 0.6165 |
66
- | **English-Problem-Solving** | ARC Easy (5-shot) | 0.6940 | 0.8043 | 0.6780 | 0.6111 | 0.6237 |
67
- | | ARC Challenge (5-shot) | 0.3532 | 0.4846 | 0.3370 | 0.3063 | 0.3157 |
68
- | | MMLU (5-shot) | 0.3124 | 0.6165 | 0.4818 | 0.2869 | 0.2654 |
69
- | **English-Mathematics** | GSM8K (5-shot) | 0.0637 | 0.6194 | 0.3495 | 0.0159 | 0.0182 |
70
- | | Minerva Math (4-shot) | 0.0180 | 0.2876 | 0.1160 | 0.0182 | 0.0010 |
71
- | **Chinese** | CEval (5-shot) | 0.2779 | 0.6954 | 0.5423 | 0.2340 | 0.2422 |
72
- | | CMMLU (5-shot) | 0.2687 | 0.6882 | 0.5300 | 0.2570 | 0.2468 |
73
- | **Average Metrics** | **Average-English(w/o Math)** | 0.4859 | 0.6134 | 0.5053 | 0.4627 | 0.4345 |
74
- | | **Average-English** | 0.4118 | 0.5868 | 0.4599 | 0.3884 | 0.3637 |
75
- | | **Average-Chinese** | 0.2733 | 0.6918 | 0.5362 | 0.2455 | 0.2445 |
76
- | | **Average** | 0.3920 | 0.6018 | 0.4708 | 0.3680 | 0.3466 |
77
- | | **Average(w/o Math)** | 0.4505 | 0.6265 | 0.5105 | 0.4265 | 0.4028 |
78
-
79
- OpenSeek-Small-v1 demonstrates superior efficiency compared to 1-billion-parameter models.
80
-
81
- - <img src="logC_vs_Metric_Average_scatter_plot.png" alt="logC_vs_Metric_Average" width="400"/>
82
 
83
  ## Usage Instructions
84
  ```python
85
  from transformers import AutoModelForCausalLM, AutoTokenizer
86
 
87
- model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
88
- tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
89
 
90
  inputs = tokenizer("The future of AI is", return_tensors="pt")
 
91
  outputs = model.generate(**inputs, max_length=50)
92
  print(tokenizer.decode(outputs[0]))
93
  ```
 
5
  - Utilizes DeepSeek-V3-like MoE architecture.
6
  - Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
7
  - Trained on 720 billion tokens.
8
+ - Completely broken in stock form.
9
 
10
+ Key Fixes in this repository:
11
+ - Fixed Broken Position Embeddings
12
+ - Fixed Fundamental Incompatibilities Between Deepseekv3 Model and Qwen Tokenizer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ## Usage Instructions
15
  ```python
16
  from transformers import AutoModelForCausalLM, AutoTokenizer
17
 
18
+ model = AutoModelForCausalLM.from_pretrained("Robertp423/OpenSeek-Fixed",trust_remote_code=True)
19
+ tokenizer = AutoTokenizer.from_pretrained("Robertp423/OpenSeek-Fixed",trust_remote_code=True)
20
 
21
  inputs = tokenizer("The future of AI is", return_tensors="pt")
22
+ inputs.pop("token_type_ids", None) # Critical fix
23
  outputs = model.generate(**inputs, max_length=50)
24
  print(tokenizer.decode(outputs[0]))
25
  ```
config.json CHANGED
@@ -19,7 +19,6 @@
19
  "initializer_range": 0.006,
20
  "intermediate_size": 7168,
21
  "kv_lora_rank": 512,
22
- "position_embedding": "rope_64dim",
23
  "max_position_embeddings": 4096,
24
  "model_type": "deepseek_v3",
25
  "moe_intermediate_size": 896,
@@ -34,6 +33,8 @@
34
  "num_key_value_heads": 10,
35
  "num_nextn_predict_layers": 1,
36
  "pretraining_tp": 1,
 
 
37
  "q_lora_rank": null,
38
  "qk_nope_head_dim": 128,
39
  "qk_rope_head_dim": 64,
 
19
  "initializer_range": 0.006,
20
  "intermediate_size": 7168,
21
  "kv_lora_rank": 512,
 
22
  "max_position_embeddings": 4096,
23
  "model_type": "deepseek_v3",
24
  "moe_intermediate_size": 896,
 
33
  "num_key_value_heads": 10,
34
  "num_nextn_predict_layers": 1,
35
  "pretraining_tp": 1,
36
+ "problematic_params": ["token_type_ids"],
37
+ "position_embedding_type": "rope",
38
  "q_lora_rank": null,
39
  "qk_nope_head_dim": 128,
40
  "qk_rope_head_dim": 64,
special_tokens_map.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "bos_token": "<|extra_203|>",
3
  "eos_token": "<|extra_204|>",
4
- "unk_token": "[UNK]",
5
- "pad_token": "[PAD]"
6
  }
 
1
  {
2
  "bos_token": "<|extra_203|>",
3
  "eos_token": "<|extra_204|>",
4
+ "unk_token": "<|endoftext|>",
5
+ "pad_token": "<|endoftext|>"
6
  }
tokenizer_config.json CHANGED
@@ -1,11 +1,25 @@
1
  {
2
- "model_max_length": 8192,
3
  "tokenizer_class": "QWenTokenizer",
4
  "auto_map": {
5
  "AutoTokenizer": [
6
  "tokenization_qwen.QWenTokenizer",
7
  null
8
- ]
9
  },
10
- "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
11
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  {
2
+ "model_max_length": 4096,
3
  "tokenizer_class": "QWenTokenizer",
4
  "auto_map": {
5
  "AutoTokenizer": [
6
  "tokenization_qwen.QWenTokenizer",
7
  null
8
+ ]
9
  },
10
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
11
+ "added_tokens_map": {
12
+ "<|endoftext|>": {
13
+ "content": "<|endoftext|>",
14
+ "single_word": false
15
+ },
16
+ "<|im_start|>": {
17
+ "content": "<|im_start|>",
18
+ "single_word": false
19
+ },
20
+ "<|im_end|>": {
21
+ "content": "<|im_end|>",
22
+ "single_word": false
23
+ }
24
+ }
25
+ }