You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Enhanced Hybrid Transformer (768D) - Untrained

🚧 This is an UNTRAINED model architecture ready for training!

A modern language model architecture combining LLaMA and Qwen innovations. This repository contains the model structure and will be updated with trained weights.

Model Architecture

  • Parameters: 142,425,600
  • Architecture: Hybrid Transformer (LLaMA + Qwen inspired)
  • Layers: 12
  • Hidden Size: 768
  • Attention: Grouped Query Attention (GQA-4)
  • Feed Forward: SwiGLU activation
  • Normalization: RMSNorm
  • Position Encoding: RoPE (Rotary Position Embedding)
  • Max Sequence Length: 2048
  • Vocabulary: 50257 tokens (GPT-2 tokenizer)

Key Features

  • βœ… Memory Efficient: GQA-4 reduces KV cache memory by 66.7%
  • βœ… Modern Architecture: SwiGLU, RMSNorm, RoPE
  • βœ… Training Ready: Optimized for training stability
  • βœ… Compatible: Standard transformers interface

Training Status

πŸ”„ Currently: Untrained (random weights) πŸ“ˆ Training: Will be trained on Google Colab 🎯 Dataset: WikiText-2 and custom datasets ⏱️ Updates: Weights will be updated automatically during training

Usage (After Training)

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("shivash/enhanced-hybrid-transformer-768d")
model = AutoModelForCausalLM.from_pretrained("shivash/enhanced-hybrid-transformer-768d")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Configuration

# Recommended training settings
training_config = {
    "learning_rate": 5e-5,
    "batch_size": 8,
    "gradient_accumulation_steps": 4,  # Effective batch size: 32
    "num_epochs": 3,
    "warmup_ratio": 0.1,
    "weight_decay": 1e-2,
    "max_grad_norm": 1.0,
    "use_amp": True
}

Architecture Details

Attention Mechanism

  • Type: Grouped Query Attention (GQA)
  • Query Heads: 12
  • Key/Value Heads: 4
  • Head Dimension: 64

Feed Forward Network

  • Activation: SwiGLU
  • Hidden Dimension: 3072
  • Gating: Yes (SwiGLU includes gating mechanism)

Position Encoding

  • Type: RoPE (Rotary Position Embedding)
  • Base: 10000.0
  • Scaling: 1.0

Model Size Comparison

Component Parameters Percentage
Embeddings ~38597K ~27.1%
Transformer Layers ~103828K ~72.9%

Updates

  • Initial Upload: Model architecture and configuration
  • Training Progress: Will be updated during training on Google Colab
  • Final Weights: Trained weights will replace random initialization

Generated with Claude Code

This model is part of the Enhanced Hybrid Transformer series exploring efficient architectures for language modeling.

Downloads last month
73
Safetensors
Model size
142M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support