You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Enhanced Hybrid Transformer (768D) - Untrained

🚧 This is an UNTRAINED model architecture ready for training!

A modern language model architecture combining LLaMA and Qwen innovations. This repository contains the model structure and will be updated with trained weights.

Model Architecture

Parameters: 142,425,600
Architecture: Hybrid Transformer (LLaMA + Qwen inspired)
Layers: 12
Hidden Size: 768
Attention: Grouped Query Attention (GQA-4)
Feed Forward: SwiGLU activation
Normalization: RMSNorm
Position Encoding: RoPE (Rotary Position Embedding)
Max Sequence Length: 2048
Vocabulary: 50257 tokens (GPT-2 tokenizer)

Key Features

✅ Memory Efficient: GQA-4 reduces KV cache memory by 66.7%
✅ Modern Architecture: SwiGLU, RMSNorm, RoPE
✅ Training Ready: Optimized for training stability
✅ Compatible: Standard transformers interface

Training Status

🔄 Currently: Untrained (random weights) 📈 Training: Will be trained on Google Colab 🎯 Dataset: WikiText-2 and custom datasets ⏱️ Updates: Weights will be updated automatically during training

Usage (After Training)

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("shivash/enhanced-hybrid-transformer-768d")
model = AutoModelForCausalLM.from_pretrained("shivash/enhanced-hybrid-transformer-768d")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Configuration

# Recommended training settings
training_config = {
    "learning_rate": 5e-5,
    "batch_size": 8,
    "gradient_accumulation_steps": 4,  # Effective batch size: 32
    "num_epochs": 3,
    "warmup_ratio": 0.1,
    "weight_decay": 1e-2,
    "max_grad_norm": 1.0,
    "use_amp": True
}

Architecture Details

Attention Mechanism

Type: Grouped Query Attention (GQA)
Query Heads: 12
Key/Value Heads: 4
Head Dimension: 64

Feed Forward Network

Activation: SwiGLU
Hidden Dimension: 3072
Gating: Yes (SwiGLU includes gating mechanism)

Position Encoding

Type: RoPE (Rotary Position Embedding)
Base: 10000.0
Scaling: 1.0

Model Size Comparison

Component	Parameters	Percentage
Embeddings	~38597K	~27.1%
Transformer Layers	~103828K	~72.9%

Updates

Initial Upload: Model architecture and configuration
Training Progress: Will be updated during training on Google Colab
Final Weights: Trained weights will replace random initialization

Generated with Claude Code

This model is part of the Enhanced Hybrid Transformer series exploring efficient architectures for language modeling.

Downloads last month: 73

Safetensors

Model size

142M params

Tensor type

F32