shivash/hybrid-transformer-276m-v2
π 276M Parameter Hybrid Transformer V2 with GQA-4 Attention
Version 2 Improvements:
- β Fixed HF transformers compatibility
- β Proper decoder-only architecture
- β No more "memory" argument errors
- β Compatible with HF generation pipeline
- β Standard causal language model behavior
β¨ Key Features
- π§ GQA-4 Attention: 75% memory reduction with minimal quality loss
- π Parameters: 276,071,424 parameters (276M)
- ποΈ Architecture: Fixed decoder-only design for HF compatibility
- π Context Length: 4K tokens (8K effective with RoPE scaling)
- β‘ Efficiency: Optimized for production deployment
- π§ HF Compatible: Works with transformers pipeline and generation
π Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer (V2 - Fixed compatibility)
model_name = "shivash/hybrid-transformer-276m-v2"
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Use GPT2 tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
# Generate text (now works!)
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.8)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
π Model Specifications
Specification | Value |
---|---|
Parameters | 276,071,424 |
Architecture | Decoder-only (V2 Fixed) |
Attention Type | GQA-4 (4 groups) |
Layers | 16 |
Hidden Size | 1024 |
Attention Heads | 16 |
Vocabulary Size | 32,000 |
Context Length | 4,096 tokens |
Memory Reduction | 75% vs MHA |
π§ V2 Architecture Fixes
- Decoder-Only: Properly marked as
is_decoder=True
- No Encoder:
is_encoder_decoder=False
- Causal Masking: Built-in causal attention masks
- Self-Attention: No external memory requirements
- HF Compatible: Works with standard generation methods
β οΈ Note
This is Version 2 with fixed architecture compatibility. The weights are randomly initialized and ready for training on your target dataset.
π What's New in V2
- Fixed "TransformerDecoderLayer.forward() missing memory" error
- Compatible with HF transformers generation pipeline
- Proper causal language model behavior
- Improved integration with HF ecosystem
π License
Apache 2.0 License
π€ Contributing
This is V2 of the Hybrid Transformer research project with improved HF compatibility.
- Downloads last month
- 42