JingzeShi commited on
Commit
1e853ed
verified
1 Parent(s): 0c49a23

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -36,7 +36,7 @@ tags:
36
  </a>
37
  </div>
38
 
39
- Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by [SmallDoge](https://huggingface.co/SmallDoge) community, for detailed algorithm and model architecture, please refer to [Wonderful Matrices](https://arxiv.org/abs/2412.11834), all training details and code are publicly available on the [small-doge](https://github.com/SmallDoges/small-doge) repository.
40
 
41
 
42
  ## Uses
@@ -81,7 +81,7 @@ outputs = model.generate(
81
 
82
  ## Model Details
83
 
84
- We build the Doge-Instruct by first SFT on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) and then DPO on [UltraFeedback Binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized).
85
 
86
  **SFT**:
87
  | Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
@@ -91,6 +91,7 @@ We build the Doge-Instruct by first SFT on [SmolTalk](https://huggingface.co/dat
91
  | [Doge-160M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-160M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 4e-4 | 0.25M | bfloat16 |
92
  | [Doge-320M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-320M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 2e-4 | 0.25M | bfloat16 |
93
 
 
94
  **Procedure**:
95
 
96
  **SFT**:
 
36
  </a>
37
  </div>
38
 
39
+ Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by [SmallDoge](https://huggingface.co/SmallDoge) community, for detailed algorithm and model architecture, paper coming soon, all training details and code are available in the [small-doge](https://github.com/SmallDoges/small-doge) repository.
40
 
41
 
42
  ## Uses
 
81
 
82
  ## Model Details
83
 
84
+ We build the Doge-Instruct-SFT by SFT on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk).
85
 
86
  **SFT**:
87
  | Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
 
91
  | [Doge-160M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-160M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 4e-4 | 0.25M | bfloat16 |
92
  | [Doge-320M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-320M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 2e-4 | 0.25M | bfloat16 |
93
 
94
+
95
  **Procedure**:
96
 
97
  **SFT**: