QuixiAI
/

Qwen3-58B-Embiggened

Safetensors

qwen3

Model card Files Files and versions

xet

Community

ehartford commited on Jun 12

Commit

285d381

verified ·

1 Parent(s): 0b84f23

Create README.md

Browse files

Files changed (1) hide show

README.md +126 -0

README.md ADDED Viewed

	@@ -0,0 +1,126 @@

+# Qwen3-32B-to-72B-Stage1
+## Model Description
+This is an intermediate checkpoint in the process of expanding Qwen3-32B to match Qwen3-72B architecture dimensions. This model represents Stage 1 of a two-stage upscaling process, where the hidden dimensions and attention heads have been expanded, but the model still maintains 64 layers.
+**⚠️ Note: This is an intermediate checkpoint not intended for direct use. For the complete model, use Qwen3-72B-Embiggened.**
+## Architecture Changes
+### Original Qwen3-32B
+- Hidden size: 5,120
+- Intermediate size: 25,600
+- Attention heads: 40 (64 Q heads due to GQA)
+- KV heads: 8
+- Layers: 64
+### Stage 1 Output (This Model)
+- Hidden size: 8,192 ✅
+- Intermediate size: 29,568 ✅
+- Attention heads: 64 ✅
+- KV heads: 8 ✅
+- Layers: 64 (unchanged)
+## Methodology
+This model was created using structure-aware linear interpolation with the following techniques:
+1. **Layer-Dependent Interpolation Weights**
+   - Early layers (0-25%): Conservative interpolation (weight=0.3)
+   - Middle layers (25-75%): Balanced interpolation (weight=0.5)
+   - Late layers (75-100%): Aggressive interpolation (weight=0.7)
+2. **Structured Noise Addition**
+   - Small amounts of structured noise (0.5%) added to break symmetry
+   - Reduced noise in central components to preserve important features
+3. **Norm Preservation**
+   - Original tensor norms preserved during interpolation
+   - Critical for maintaining stable activations
+4. **Component-Specific Handling**
+   - Embeddings: Conservative interpolation (0.3)
+   - Attention projections: Proper handling of GQA architecture
+   - MLP layers: More aggressive interpolation with layer-dependent weights
+## Technical Details
+### Dimension Transformations
+```
+lm_head: [151936, 5120] → [151936, 8192]
+embed_tokens: [151936, 5120] → [151936, 8192]
+q_proj: [8192, 5120] → [8192, 8192]
+k_proj: [1024, 5120] → [1024, 8192]
+v_proj: [1024, 5120] → [1024, 8192]
+o_proj: [5120, 8192] → [8192, 8192]
+gate_proj: [25600, 5120] → [29568, 8192]
+up_proj: [25600, 5120] → [29568, 8192]
+down_proj: [5120, 25600] → [8192, 29568]
+```
+### Key Insights
+- Qwen3-32B already uses asymmetric attention with 64 Q heads despite 5120 hidden size
+- Group Query Attention (GQA) maintained with 8 KV heads
+- All interpolations preserve the mathematical properties of the original weights
+## Usage
+This is an intermediate checkpoint. To use the complete 72B model:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load the complete model instead
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen3-72B-Embiggened",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+```
+## Hardware Requirements
+- Minimum VRAM: ~130GB (for full model in bf16)
+- Recommended: Multiple GPUs with at least 160GB total VRAM
+- Tested on: 8x AMD MI300X GPUs
+## Limitations
+1. This is an intermediate checkpoint - layer count doesn't match Qwen3-72B
+2. Not fine-tuned or aligned - raw interpolated weights only
+3. May exhibit some instabilities due to interpolation artifacts
+4. Performance characteristics undefined without further training
+## Next Steps
+To complete the expansion to Qwen3-72B architecture:
+1. Use Stage 2 processing to expand from 64 to 80 layers
+2. Consider fine-tuning on high-quality datasets
+3. Apply alignment techniques if needed for specific use cases
+## Citation
+If you use this work, please cite:
+```bibtex
+@misc{qwen3-embiggening-2025,
+  title={Qwen3 32B to 72B Architecture Expansion via Structure-Aware Interpolation},
+  author={[Your Name]},
+  year={2025},
+  howpublished={\url{https://github.com/yourusername/qwen3-embiggening}}
+}
+```
+## License
+This model inherits the license from the original Qwen3-32B model. Please refer to the original model card for licensing information.
+## Acknowledgments
+- Original Qwen3-32B model by Alibaba Cloud
+- Interpolation techniques inspired by model merging research
+- "Embiggened" - A perfectly cromulent word