Enhanced Hybrid Transformer (768D) - Untrained
π§ This is an UNTRAINED model architecture ready for training!
A modern language model architecture combining LLaMA and Qwen innovations. This repository contains the model structure and will be updated with trained weights.
Model Architecture
- Parameters: 142,425,600
- Architecture: Hybrid Transformer (LLaMA + Qwen inspired)
- Layers: 12
- Hidden Size: 768
- Attention: Grouped Query Attention (GQA-4)
- Feed Forward: SwiGLU activation
- Normalization: RMSNorm
- Position Encoding: RoPE (Rotary Position Embedding)
- Max Sequence Length: 2048
- Vocabulary: 50257 tokens (GPT-2 tokenizer)
Key Features
- β Memory Efficient: GQA-4 reduces KV cache memory by 66.7%
- β Modern Architecture: SwiGLU, RMSNorm, RoPE
- β Training Ready: Optimized for training stability
- β Compatible: Standard transformers interface
Training Status
π Currently: Untrained (random weights) π Training: Will be trained on Google Colab π― Dataset: WikiText-2 and custom datasets β±οΈ Updates: Weights will be updated automatically during training
Usage (After Training)
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("shivash/enhanced-hybrid-transformer-768d")
model = AutoModelForCausalLM.from_pretrained("shivash/enhanced-hybrid-transformer-768d")
# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Configuration
# Recommended training settings
training_config = {
"learning_rate": 5e-5,
"batch_size": 8,
"gradient_accumulation_steps": 4, # Effective batch size: 32
"num_epochs": 3,
"warmup_ratio": 0.1,
"weight_decay": 1e-2,
"max_grad_norm": 1.0,
"use_amp": True
}
Architecture Details
Attention Mechanism
- Type: Grouped Query Attention (GQA)
- Query Heads: 12
- Key/Value Heads: 4
- Head Dimension: 64
Feed Forward Network
- Activation: SwiGLU
- Hidden Dimension: 3072
- Gating: Yes (SwiGLU includes gating mechanism)
Position Encoding
- Type: RoPE (Rotary Position Embedding)
- Base: 10000.0
- Scaling: 1.0
Model Size Comparison
Component | Parameters | Percentage |
---|---|---|
Embeddings | ~38597K | ~27.1% |
Transformer Layers | ~103828K | ~72.9% |
Updates
- Initial Upload: Model architecture and configuration
- Training Progress: Will be updated during training on Google Colab
- Final Weights: Trained weights will replace random initialization
Generated with Claude Code
This model is part of the Enhanced Hybrid Transformer series exploring efficient architectures for language modeling.
- Downloads last month
- 73