i3-80M - Hybrid Architecture Language Model

Model Description

The i3-80M Model is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

This is the second model in the i3 series, scaling up from the original i3-22M with improved architecture and multi-dataset training.

Model Statistics

Total Parameters: ~82.77M (82,765,160)
Architecture: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
Vocabulary Size: 35,560 tokens (variable-length chunks with token)
Hidden Dimension (d_model): 512
Attention Heads: 16
State Dimension (d_state): 32
Max Sequence Length: 256
Tokenization: Memory-efficient variable-length chunking (2-3 characters)

Architecture Breakdown

Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
              ├─ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network (4x expansion)

Layers 11-16: Full Attention Blocks
              ├─ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network (4x expansion)

Comparison with i3-22M

Feature	i3-22M	i3-80M (This Model)
Parameters	22.6M	82.77M
Architecture	24 Hybrid Layers	10 Hybrid + 6 Attention Layers
Hidden Dimension	512	512
Vocabulary Size	4,466	35,560
Training Dataset	TinyChat only	TinyStories + TinyChat + HQ Sentences
Total Tokens	~1M conversations	3,000,000+ tokens
Final Loss	~2.0	~2.0
Final Perplexity	7.29-9.70	7.29-10.0
Training Time	~17 hours	~2-4 hours
Attention Layers	None (Pure Hybrid)	6 Full Attention Layers

Key Improvements Over i3-22M

Hybrid Architecture: Introduces full multi-head attention in upper layers for better long-range dependencies
Larger Vocabulary: 8x larger vocabulary (35,560 vs 4,466) for better token coverage
Multi-Dataset Training: Trained on 3 diverse datasets vs single dataset
Better Generalization: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences)
Enhanced Unknown Token Handling: Robust token system for out-of-vocabulary words

When to Use Each Model

Use i3-22M if you need:

Smaller model size (~22M params)
Pure conversational focus (TinyChat specialized)
Lower memory footprint
Faster inference

Use i3-80M if you need:

Better general-purpose text generation
Stronger attention-based reasoning (6 attention layers)
Larger vocabulary coverage
Multi-domain text understanding (stories, chat, formal text)

Key Features

Hybrid Architecture: Combines the efficiency of recurrent/convolutional processing with the power of attention
- Early layers use RWKV-Mamba hybrid for efficient sequence processing
- Later layers use full multi-head attention for complex pattern recognition
Memory-Optimized Training:
- Streaming vocabulary building (no full text storage)
- Vocabulary caching (build once, reuse)
- Efficient chunk frequency counting
- Automatic memory cleanup
Multi-Dataset Pre-training: Trained on diverse text sources for robust language understanding
- TinyStories: Narrative and storytelling
- TinyChat: Conversational dynamics
- High-Quality English Sentences: Linguistic diversity
Smart Tokenization: Variable-length chunking (2-3 chars) with common trigram optimization
- Total tokens processed: 3,000,000+
- Handles unknown tokens gracefully with token

Training Details

Training Configuration

Datasets:
- agentlans/high-quality-english-sentences
- roneneldan/TinyStories
- starhopp3r/TinyChat
Training Steps: 5,000 iterations
Batch Size: 4 (with gradient accumulation support)
Learning Rate: 3e-4 (with warmup and cosine decay)
Optimizer: AdamW with gradient clipping (max norm: 1.0)
Hardware: NVIDIA GeForce RTX 3060 (12GB VRAM)
Training Time: ~2-4 hours
Framework: PyTorch

Training Dynamics

GPU Utilization: Stable at ~15-20% during training
GPU Memory: ~18% allocated (~2.2GB / 12GB)
Power Usage: ~40W average
Throughput: ~100-550 tokens/sec

Performance Metrics

Metric	Initial	Final	Best
Training Loss	~6.0	~2.0	1.98
Perplexity	~400+	~7-10	7.29

I dont know why the logging starts at step 4.6k .

i3-22m and i3-80m comparation?

The model shows strong convergence with stable training dynamics and efficient GPU utilization.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m")
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m")

# Generate text
prompt = "hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.8,
    top_k=40
)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)

For custom usage with the original training code, check user.py.

Technical Innovations

RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics
- Linear complexity for long sequences
- Efficient recurrent processing
- State-space modeling for temporal dependencies
Hierarchical Processing:
- Lower layers focus on local patterns (conv/recurrent)
- Upper layers capture global dependencies (attention)
Memory Efficiency:
- Streaming tokenization during vocab building
- No full dataset storage in RAM
- Automatic cleanup of intermediate data

Model Files

pytorch_model.bin: Model weights
config.json: Model configuration
chunk_vocab_combined.json: Tokenizer vocabulary

Training Tracking

This model was tracked using Weights & Biases (WandB) with comprehensive metrics:

Real-time loss and perplexity tracking
Gradient norm monitoring
Learning rate scheduling visualization
Generation samples logged to tables
Model checkpoints as artifacts
System resource monitoring

Limitations

Trained on English text only
Limited to 256 token context window
May require fine-tuning for specific downstream tasks
Conversational style influenced by TinyChat dataset

Model Series

i3-22M - Original model with pure hybrid architecture
i3-80M (This model) - Scaled version with attention layers and multi-dataset training

Citation

@misc{i3-80m,
  author = {FlameF0X},
  title = {i3-80M: Hybrid Architecture Language Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}}
}

Downloads last month: 68

Safetensors

Model size

82.8M params

Tensor type

F32

Datasets used to train FlameF0X/i3-80m

Collection including FlameF0X/i3-80m

i3-architecture

Collection

5 items • Updated 5 days ago • 1