Doctor-Shotgun commited on
Commit
2d9e23e
·
verified ·
1 Parent(s): 6742e10

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen3-Coder-30B-A3B-Instruct
5
+ library_name: transformers
6
+ ---
7
+
8
+ # Qwen3-Coder-30B-A3B-Instruct-ScatterMoE
9
+
10
+ Re-packed weights of [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct) using [Charles Goddard](https://huggingface.co/chargoddard)'s remote code implementation of [scattermoe](https://github.com/shawntan/scattermoe), including scripts to convert to and from standard `Qwen3MoeForCausalLM`. Thank you to [intervitens](https://huggingface.co/intervitens) for assistance with memory-efficient conversion scripts!
11
+
12
+ This is intended to be used as a drop-in replacement for efficient training using any `transformers`-based training repository.
13
+
14
+ Optional monkeypatches included for [Liger Kernel](https://github.com/linkedin/Liger-Kernel) and [Cut Cross-Entropy](https://github.com/apple/ml-cross-entropy). Simply rename the relevant modeling file to `modeling_qwen3_shared_moe.py`.
15
+
16
+ ## Citations
17
+
18
+ ```
19
+ @misc{qwen3technicalreport,
20
+ title={Qwen3 Technical Report},
21
+ author={Qwen Team},
22
+ year={2025},
23
+ eprint={2505.09388},
24
+ archivePrefix={arXiv},
25
+ primaryClass={cs.CL},
26
+ url={https://arxiv.org/abs/2505.09388},
27
+ }
28
+
29
+ @misc{tan2024scatteredmixtureofexpertsimplementation,
30
+ title={Scattered Mixture-of-Experts Implementation},
31
+ author={Shawn Tan and Yikang Shen and Rameswar Panda and Aaron Courville},
32
+ year={2024},
33
+ eprint={2403.08245},
34
+ archivePrefix={arXiv},
35
+ primaryClass={cs.LG},
36
+ url={https://arxiv.org/abs/2403.08245},
37
+ }
38
+
39
+ @misc{hsu2025ligerkernelefficienttriton,
40
+ title={Liger Kernel: Efficient Triton Kernels for LLM Training},
41
+ author={Pin-Lun Hsu and Yun Dai and Vignesh Kothapalli and Qingquan Song and Shao Tang and Siyu Zhu and Steven Shimizu and Shivam Sahni and Haowen Ning and Yanning Chen},
42
+ year={2025},
43
+ eprint={2410.10989},
44
+ archivePrefix={arXiv},
45
+ primaryClass={cs.LG},
46
+ url={https://arxiv.org/abs/2410.10989},
47
+ }
48
+
49
+ @misc{wijmans2025cutlosseslargevocabularylanguage,
50
+ title={Cut Your Losses in Large-Vocabulary Language Models},
51
+ author={Erik Wijmans and Brody Huval and Alexander Hertzberg and Vladlen Koltun and Philipp Krähenbühl},
52
+ year={2025},
53
+ eprint={2411.09009},
54
+ archivePrefix={arXiv},
55
+ primaryClass={cs.LG},
56
+ url={https://arxiv.org/abs/2411.09009},
57
+ }
58
+ ```