JingzeShi commited on
Commit
0e189da
verified
1 Parent(s): f0c53fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -25,9 +25,9 @@ tags:
25
  <a href="https://discord.gg/P2yYH95N" target="_blank" style="margin: 2px;">
26
  <img alt="Discord" src="https://img.shields.io/badge/Discord-Small%20Doges-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
27
  </a>
28
- <a href="https://arxiv.org/abs/2412.11834" target="_blank" style="margin: 2px;">
29
  <img alt="arXiv" src="https://img.shields.io/static/v1?label=arXiv&message=2412.11834&color=B31B1B&logo=arXiv" style="display: inline-block; vertical-align: middle;"/>
30
- </a>
31
  <a href="https://github.com/SmallDoges/small-doge" target="_blank" style="margin: 2px;">
32
  <img alt="GitHub" src="https://img.shields.io/badge/GitHub-SmallDoge-181717?logo=github" style="display: inline-block; vertical-align: middle;"/>
33
  </a>
@@ -36,7 +36,7 @@ tags:
36
  </a>
37
  </div>
38
 
39
- Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by [SmallDoge](https://huggingface.co/SmallDoge) community, for detailed algorithm and model architecture, please refer to [Wonderful Matrices](https://arxiv.org/abs/2412.11834), all training details and code are publicly available on the [small-doge](https://github.com/SmallDoges/small-doge) repository.
40
 
41
 
42
  ## Uses
@@ -83,13 +83,13 @@ outputs = model.generate(
83
 
84
  We build the Doge-Instruct-SFT by SFT on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk).
85
 
86
- > TODO: The larger model is under training and will be uploaded soon.
87
-
88
  **SFT**:
89
  | Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
90
  |---|---|---|---|---|---|---|
91
- | [Doge-20M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-20M-Instruct-SFT) | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 8e-4 | 0.25M | bfloat16 |
92
- | [Doge-60M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-60M-Instruct-SFT) | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 6e-4 | 0.25M | bfloat16 |
 
 
93
 
94
 
95
  **Procedure**:
 
25
  <a href="https://discord.gg/P2yYH95N" target="_blank" style="margin: 2px;">
26
  <img alt="Discord" src="https://img.shields.io/badge/Discord-Small%20Doges-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
27
  </a>
28
+ <!-- <a href="https://arxiv.org/abs/2412.11834" target="_blank" style="margin: 2px;">
29
  <img alt="arXiv" src="https://img.shields.io/static/v1?label=arXiv&message=2412.11834&color=B31B1B&logo=arXiv" style="display: inline-block; vertical-align: middle;"/>
30
+ </a> -->
31
  <a href="https://github.com/SmallDoges/small-doge" target="_blank" style="margin: 2px;">
32
  <img alt="GitHub" src="https://img.shields.io/badge/GitHub-SmallDoge-181717?logo=github" style="display: inline-block; vertical-align: middle;"/>
33
  </a>
 
36
  </a>
37
  </div>
38
 
39
+ Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by [SmallDoge](https://huggingface.co/SmallDoge) community, for detailed algorithm and model architecture, paper coming soon, all training details and code are available in the [small-doge](https://github.com/SmallDoges/small-doge) repository.
40
 
41
 
42
  ## Uses
 
83
 
84
  We build the Doge-Instruct-SFT by SFT on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk).
85
 
 
 
86
  **SFT**:
87
  | Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
88
  |---|---|---|---|---|---|---|
89
+ | [Doge-20M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-20M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 8e-4 | 0.25M | bfloat16 |
90
+ | [Doge-60M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-60M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 6e-4 | 0.25M | bfloat16 |
91
+ | [Doge-160M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-160M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 4e-4 | 0.25M | bfloat16 |
92
+ | [Doge-320M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-320M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 2e-4 | 0.25M | bfloat16 |
93
 
94
 
95
  **Procedure**: