i3-80M - Hybrid Architecture Language Model
Model Description
The i3-80M Model is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
This is the second model in the i3 series, scaling up from the original i3-22M with improved architecture and multi-dataset training.
Model Statistics
- Total Parameters: ~82.77M (82,765,160)
- Architecture: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
- Vocabulary Size: 35,560 tokens (variable-length chunks with token)
- Hidden Dimension (d_model): 512
- Attention Heads: 16
- State Dimension (d_state): 32
- Max Sequence Length: 256
- Tokenization: Memory-efficient variable-length chunking (2-3 characters)
Architecture Breakdown
Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
├─ RWKVMambaHybrid (Time-mixing + State-space)
└─ Feed-Forward Network (4x expansion)
Layers 11-16: Full Attention Blocks
├─ Multi-Head Attention (16 heads)
└─ Feed-Forward Network (4x expansion)
Comparison with i3-22M
| Feature | i3-22M | i3-80M (This Model) |
|---|---|---|
| Parameters | 22.6M | 82.77M |
| Architecture | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers |
| Hidden Dimension | 512 | 512 |
| Vocabulary Size | 4,466 | 35,560 |
| Training Dataset | TinyChat only | TinyStories + TinyChat + HQ Sentences |
| Total Tokens | ~1M conversations | 3,000,000+ tokens |
| Final Loss | ~2.0 | ~2.0 |
| Final Perplexity | 7.29-9.70 | 7.29-10.0 |
| Training Time | ~17 hours | ~2-4 hours |
| Attention Layers | None (Pure Hybrid) | 6 Full Attention Layers |
Key Improvements Over i3-22M
- Hybrid Architecture: Introduces full multi-head attention in upper layers for better long-range dependencies
- Larger Vocabulary: 8x larger vocabulary (35,560 vs 4,466) for better token coverage
- Multi-Dataset Training: Trained on 3 diverse datasets vs single dataset
- Better Generalization: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences)
- Enhanced Unknown Token Handling: Robust token system for out-of-vocabulary words
When to Use Each Model
Use i3-22M if you need:
- Smaller model size (~22M params)
- Pure conversational focus (TinyChat specialized)
- Lower memory footprint
- Faster inference
Use i3-80M if you need:
- Better general-purpose text generation
- Stronger attention-based reasoning (6 attention layers)
- Larger vocabulary coverage
- Multi-domain text understanding (stories, chat, formal text)
Key Features
Hybrid Architecture: Combines the efficiency of recurrent/convolutional processing with the power of attention
- Early layers use RWKV-Mamba hybrid for efficient sequence processing
- Later layers use full multi-head attention for complex pattern recognition
Memory-Optimized Training:
- Streaming vocabulary building (no full text storage)
- Vocabulary caching (build once, reuse)
- Efficient chunk frequency counting
- Automatic memory cleanup
Multi-Dataset Pre-training: Trained on diverse text sources for robust language understanding
- TinyStories: Narrative and storytelling
- TinyChat: Conversational dynamics
- High-Quality English Sentences: Linguistic diversity
Smart Tokenization: Variable-length chunking (2-3 chars) with common trigram optimization
- Total tokens processed: 3,000,000+
- Handles unknown tokens gracefully with token
Training Details
Training Configuration
- Datasets:
agentlans/high-quality-english-sentencesroneneldan/TinyStoriesstarhopp3r/TinyChat
- Training Steps: 5,000 iterations
- Batch Size: 4 (with gradient accumulation support)
- Learning Rate: 3e-4 (with warmup and cosine decay)
- Optimizer: AdamW with gradient clipping (max norm: 1.0)
- Hardware: NVIDIA GeForce RTX 3060 (12GB VRAM)
- Training Time: ~2-4 hours
- Framework: PyTorch
Training Dynamics
- GPU Utilization: Stable at ~15-20% during training
- GPU Memory: ~18% allocated (~2.2GB / 12GB)
- Power Usage: ~40W average
- Throughput: ~100-550 tokens/sec
Performance Metrics
| Metric | Initial | Final | Best |
|---|---|---|---|
| Training Loss | ~6.0 | ~2.0 | 1.98 |
| Perplexity | ~400+ | ~7-10 | 7.29 |
I dont know why the logging starts at step 4.6k .
i3-22m and i3-80m comparation?
The model shows strong convergence with stable training dynamics and efficient GPU utilization.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m")
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m")
# Generate text
prompt = "hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_length=100,
temperature=0.8,
top_k=40
)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)
For custom usage with the original training code, check user.py.
Technical Innovations
RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics
- Linear complexity for long sequences
- Efficient recurrent processing
- State-space modeling for temporal dependencies
Hierarchical Processing:
- Lower layers focus on local patterns (conv/recurrent)
- Upper layers capture global dependencies (attention)
Memory Efficiency:
- Streaming tokenization during vocab building
- No full dataset storage in RAM
- Automatic cleanup of intermediate data
Model Files
pytorch_model.bin: Model weightsconfig.json: Model configurationchunk_vocab_combined.json: Tokenizer vocabulary
Training Tracking
This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
- Real-time loss and perplexity tracking
- Gradient norm monitoring
- Learning rate scheduling visualization
- Generation samples logged to tables
- Model checkpoints as artifacts
- System resource monitoring
Limitations
- Trained on English text only
- Limited to 256 token context window
- May require fine-tuning for specific downstream tasks
- Conversational style influenced by TinyChat dataset
Model Series
- i3-22M - Original model with pure hybrid architecture
- i3-80M (This model) - Scaled version with attention layers and multi-dataset training
Citation
@misc{i3-80m,
author = {FlameF0X},
title = {i3-80M: Hybrid Architecture Language Model},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}}
}
- Downloads last month
- 68

