Safetensors
qwen3
ehartford commited on
Commit
285d381
Β·
verified Β·
1 Parent(s): 0b84f23

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen3-32B-to-72B-Stage1
2
+
3
+ ## Model Description
4
+
5
+ This is an intermediate checkpoint in the process of expanding Qwen3-32B to match Qwen3-72B architecture dimensions. This model represents Stage 1 of a two-stage upscaling process, where the hidden dimensions and attention heads have been expanded, but the model still maintains 64 layers.
6
+
7
+ **⚠️ Note: This is an intermediate checkpoint not intended for direct use. For the complete model, use Qwen3-72B-Embiggened.**
8
+
9
+ ## Architecture Changes
10
+
11
+ ### Original Qwen3-32B
12
+ - Hidden size: 5,120
13
+ - Intermediate size: 25,600
14
+ - Attention heads: 40 (64 Q heads due to GQA)
15
+ - KV heads: 8
16
+ - Layers: 64
17
+
18
+ ### Stage 1 Output (This Model)
19
+ - Hidden size: 8,192 βœ…
20
+ - Intermediate size: 29,568 βœ…
21
+ - Attention heads: 64 βœ…
22
+ - KV heads: 8 βœ…
23
+ - Layers: 64 (unchanged)
24
+
25
+ ## Methodology
26
+
27
+ This model was created using structure-aware linear interpolation with the following techniques:
28
+
29
+ 1. **Layer-Dependent Interpolation Weights**
30
+ - Early layers (0-25%): Conservative interpolation (weight=0.3)
31
+ - Middle layers (25-75%): Balanced interpolation (weight=0.5)
32
+ - Late layers (75-100%): Aggressive interpolation (weight=0.7)
33
+
34
+ 2. **Structured Noise Addition**
35
+ - Small amounts of structured noise (0.5%) added to break symmetry
36
+ - Reduced noise in central components to preserve important features
37
+
38
+ 3. **Norm Preservation**
39
+ - Original tensor norms preserved during interpolation
40
+ - Critical for maintaining stable activations
41
+
42
+ 4. **Component-Specific Handling**
43
+ - Embeddings: Conservative interpolation (0.3)
44
+ - Attention projections: Proper handling of GQA architecture
45
+ - MLP layers: More aggressive interpolation with layer-dependent weights
46
+
47
+ ## Technical Details
48
+
49
+ ### Dimension Transformations
50
+
51
+ ```
52
+ lm_head: [151936, 5120] β†’ [151936, 8192]
53
+ embed_tokens: [151936, 5120] β†’ [151936, 8192]
54
+ q_proj: [8192, 5120] β†’ [8192, 8192]
55
+ k_proj: [1024, 5120] β†’ [1024, 8192]
56
+ v_proj: [1024, 5120] β†’ [1024, 8192]
57
+ o_proj: [5120, 8192] β†’ [8192, 8192]
58
+ gate_proj: [25600, 5120] β†’ [29568, 8192]
59
+ up_proj: [25600, 5120] β†’ [29568, 8192]
60
+ down_proj: [5120, 25600] β†’ [8192, 29568]
61
+ ```
62
+
63
+ ### Key Insights
64
+
65
+ - Qwen3-32B already uses asymmetric attention with 64 Q heads despite 5120 hidden size
66
+ - Group Query Attention (GQA) maintained with 8 KV heads
67
+ - All interpolations preserve the mathematical properties of the original weights
68
+
69
+ ## Usage
70
+
71
+ This is an intermediate checkpoint. To use the complete 72B model:
72
+
73
+ ```python
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
75
+
76
+ # Load the complete model instead
77
+ model = AutoModelForCausalLM.from_pretrained(
78
+ "Qwen3-72B-Embiggened",
79
+ torch_dtype=torch.bfloat16,
80
+ device_map="auto",
81
+ trust_remote_code=True
82
+ )
83
+ ```
84
+
85
+ ## Hardware Requirements
86
+
87
+ - Minimum VRAM: ~130GB (for full model in bf16)
88
+ - Recommended: Multiple GPUs with at least 160GB total VRAM
89
+ - Tested on: 8x AMD MI300X GPUs
90
+
91
+ ## Limitations
92
+
93
+ 1. This is an intermediate checkpoint - layer count doesn't match Qwen3-72B
94
+ 2. Not fine-tuned or aligned - raw interpolated weights only
95
+ 3. May exhibit some instabilities due to interpolation artifacts
96
+ 4. Performance characteristics undefined without further training
97
+
98
+ ## Next Steps
99
+
100
+ To complete the expansion to Qwen3-72B architecture:
101
+ 1. Use Stage 2 processing to expand from 64 to 80 layers
102
+ 2. Consider fine-tuning on high-quality datasets
103
+ 3. Apply alignment techniques if needed for specific use cases
104
+
105
+ ## Citation
106
+
107
+ If you use this work, please cite:
108
+
109
+ ```bibtex
110
+ @misc{qwen3-embiggening-2025,
111
+ title={Qwen3 32B to 72B Architecture Expansion via Structure-Aware Interpolation},
112
+ author={[Your Name]},
113
+ year={2025},
114
+ howpublished={\url{https://github.com/yourusername/qwen3-embiggening}}
115
+ }
116
+ ```
117
+
118
+ ## License
119
+
120
+ This model inherits the license from the original Qwen3-32B model. Please refer to the original model card for licensing information.
121
+
122
+ ## Acknowledgments
123
+
124
+ - Original Qwen3-32B model by Alibaba Cloud
125
+ - Interpolation techniques inspired by model merging research
126
+ - "Embiggened" - A perfectly cromulent word