You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

shivash/hybrid-transformer-276m-v2

🚀 276M Parameter Hybrid Transformer V2 with GQA-4 Attention

Version 2 Improvements:

✅ Fixed HF transformers compatibility
✅ Proper decoder-only architecture
✅ No more "memory" argument errors
✅ Compatible with HF generation pipeline
✅ Standard causal language model behavior

✨ Key Features

🧠 GQA-4 Attention: 75% memory reduction with minimal quality loss
📊 Parameters: 276,071,424 parameters (276M)
🏗️ Architecture: Fixed decoder-only design for HF compatibility
📏 Context Length: 4K tokens (8K effective with RoPE scaling)
⚡ Efficiency: Optimized for production deployment
🔧 HF Compatible: Works with transformers pipeline and generation

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer (V2 - Fixed compatibility)
model_name = "shivash/hybrid-transformer-276m-v2"
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Use GPT2 tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Generate text (now works!)
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.8)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

📊 Model Specifications

Specification	Value
Parameters	276,071,424
Architecture	Decoder-only (V2 Fixed)
Attention Type	GQA-4 (4 groups)
Layers	16
Hidden Size	1024
Attention Heads	16
Vocabulary Size	32,000
Context Length	4,096 tokens
Memory Reduction	75% vs MHA

🔧 V2 Architecture Fixes

Decoder-Only: Properly marked as is_decoder=True
No Encoder: is_encoder_decoder=False
Causal Masking: Built-in causal attention masks
Self-Attention: No external memory requirements
HF Compatible: Works with standard generation methods

⚠️ Note

This is Version 2 with fixed architecture compatibility. The weights are randomly initialized and ready for training on your target dataset.

🆕 What's New in V2

Fixed "TransformerDecoderLayer.forward() missing memory" error
Compatible with HF transformers generation pipeline
Proper causal language model behavior
Improved integration with HF ecosystem

📄 License

Apache 2.0 License

🤝 Contributing

This is V2 of the Hybrid Transformer research project with improved HF compatibility.

Downloads last month: 42