Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Qwen3-32B-to-72B-Stage1
|
2 |
+
|
3 |
+
## Model Description
|
4 |
+
|
5 |
+
This is an intermediate checkpoint in the process of expanding Qwen3-32B to match Qwen3-72B architecture dimensions. This model represents Stage 1 of a two-stage upscaling process, where the hidden dimensions and attention heads have been expanded, but the model still maintains 64 layers.
|
6 |
+
|
7 |
+
**β οΈ Note: This is an intermediate checkpoint not intended for direct use. For the complete model, use Qwen3-72B-Embiggened.**
|
8 |
+
|
9 |
+
## Architecture Changes
|
10 |
+
|
11 |
+
### Original Qwen3-32B
|
12 |
+
- Hidden size: 5,120
|
13 |
+
- Intermediate size: 25,600
|
14 |
+
- Attention heads: 40 (64 Q heads due to GQA)
|
15 |
+
- KV heads: 8
|
16 |
+
- Layers: 64
|
17 |
+
|
18 |
+
### Stage 1 Output (This Model)
|
19 |
+
- Hidden size: 8,192 β
|
20 |
+
- Intermediate size: 29,568 β
|
21 |
+
- Attention heads: 64 β
|
22 |
+
- KV heads: 8 β
|
23 |
+
- Layers: 64 (unchanged)
|
24 |
+
|
25 |
+
## Methodology
|
26 |
+
|
27 |
+
This model was created using structure-aware linear interpolation with the following techniques:
|
28 |
+
|
29 |
+
1. **Layer-Dependent Interpolation Weights**
|
30 |
+
- Early layers (0-25%): Conservative interpolation (weight=0.3)
|
31 |
+
- Middle layers (25-75%): Balanced interpolation (weight=0.5)
|
32 |
+
- Late layers (75-100%): Aggressive interpolation (weight=0.7)
|
33 |
+
|
34 |
+
2. **Structured Noise Addition**
|
35 |
+
- Small amounts of structured noise (0.5%) added to break symmetry
|
36 |
+
- Reduced noise in central components to preserve important features
|
37 |
+
|
38 |
+
3. **Norm Preservation**
|
39 |
+
- Original tensor norms preserved during interpolation
|
40 |
+
- Critical for maintaining stable activations
|
41 |
+
|
42 |
+
4. **Component-Specific Handling**
|
43 |
+
- Embeddings: Conservative interpolation (0.3)
|
44 |
+
- Attention projections: Proper handling of GQA architecture
|
45 |
+
- MLP layers: More aggressive interpolation with layer-dependent weights
|
46 |
+
|
47 |
+
## Technical Details
|
48 |
+
|
49 |
+
### Dimension Transformations
|
50 |
+
|
51 |
+
```
|
52 |
+
lm_head: [151936, 5120] β [151936, 8192]
|
53 |
+
embed_tokens: [151936, 5120] β [151936, 8192]
|
54 |
+
q_proj: [8192, 5120] β [8192, 8192]
|
55 |
+
k_proj: [1024, 5120] β [1024, 8192]
|
56 |
+
v_proj: [1024, 5120] β [1024, 8192]
|
57 |
+
o_proj: [5120, 8192] β [8192, 8192]
|
58 |
+
gate_proj: [25600, 5120] β [29568, 8192]
|
59 |
+
up_proj: [25600, 5120] β [29568, 8192]
|
60 |
+
down_proj: [5120, 25600] β [8192, 29568]
|
61 |
+
```
|
62 |
+
|
63 |
+
### Key Insights
|
64 |
+
|
65 |
+
- Qwen3-32B already uses asymmetric attention with 64 Q heads despite 5120 hidden size
|
66 |
+
- Group Query Attention (GQA) maintained with 8 KV heads
|
67 |
+
- All interpolations preserve the mathematical properties of the original weights
|
68 |
+
|
69 |
+
## Usage
|
70 |
+
|
71 |
+
This is an intermediate checkpoint. To use the complete 72B model:
|
72 |
+
|
73 |
+
```python
|
74 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
75 |
+
|
76 |
+
# Load the complete model instead
|
77 |
+
model = AutoModelForCausalLM.from_pretrained(
|
78 |
+
"Qwen3-72B-Embiggened",
|
79 |
+
torch_dtype=torch.bfloat16,
|
80 |
+
device_map="auto",
|
81 |
+
trust_remote_code=True
|
82 |
+
)
|
83 |
+
```
|
84 |
+
|
85 |
+
## Hardware Requirements
|
86 |
+
|
87 |
+
- Minimum VRAM: ~130GB (for full model in bf16)
|
88 |
+
- Recommended: Multiple GPUs with at least 160GB total VRAM
|
89 |
+
- Tested on: 8x AMD MI300X GPUs
|
90 |
+
|
91 |
+
## Limitations
|
92 |
+
|
93 |
+
1. This is an intermediate checkpoint - layer count doesn't match Qwen3-72B
|
94 |
+
2. Not fine-tuned or aligned - raw interpolated weights only
|
95 |
+
3. May exhibit some instabilities due to interpolation artifacts
|
96 |
+
4. Performance characteristics undefined without further training
|
97 |
+
|
98 |
+
## Next Steps
|
99 |
+
|
100 |
+
To complete the expansion to Qwen3-72B architecture:
|
101 |
+
1. Use Stage 2 processing to expand from 64 to 80 layers
|
102 |
+
2. Consider fine-tuning on high-quality datasets
|
103 |
+
3. Apply alignment techniques if needed for specific use cases
|
104 |
+
|
105 |
+
## Citation
|
106 |
+
|
107 |
+
If you use this work, please cite:
|
108 |
+
|
109 |
+
```bibtex
|
110 |
+
@misc{qwen3-embiggening-2025,
|
111 |
+
title={Qwen3 32B to 72B Architecture Expansion via Structure-Aware Interpolation},
|
112 |
+
author={[Your Name]},
|
113 |
+
year={2025},
|
114 |
+
howpublished={\url{https://github.com/yourusername/qwen3-embiggening}}
|
115 |
+
}
|
116 |
+
```
|
117 |
+
|
118 |
+
## License
|
119 |
+
|
120 |
+
This model inherits the license from the original Qwen3-32B model. Please refer to the original model card for licensing information.
|
121 |
+
|
122 |
+
## Acknowledgments
|
123 |
+
|
124 |
+
- Original Qwen3-32B model by Alibaba Cloud
|
125 |
+
- Interpolation techniques inspired by model merging research
|
126 |
+
- "Embiggened" - A perfectly cromulent word
|