Text Generation
Transformers
PyTorch
Safetensors
English
i3
conversational
i3-architecture
hybrid-model
rwkv-mamba

i3-80M - Hybrid Architecture Language Model

Model Description

The i3-80M Model is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

This is the second model in the i3 series, scaling up from the original i3-22M with improved architecture and multi-dataset training.

Model Statistics

  • Total Parameters: ~82.77M (82,765,160)
  • Architecture: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
  • Vocabulary Size: 35,560 tokens (variable-length chunks with token)
  • Hidden Dimension (d_model): 512
  • Attention Heads: 16
  • State Dimension (d_state): 32
  • Max Sequence Length: 256
  • Tokenization: Memory-efficient variable-length chunking (2-3 characters)

Architecture Breakdown

Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
              ├─ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network (4x expansion)

Layers 11-16: Full Attention Blocks
              ├─ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network (4x expansion)

Comparison with i3-22M

Feature i3-22M i3-80M (This Model)
Parameters 22.6M 82.77M
Architecture 24 Hybrid Layers 10 Hybrid + 6 Attention Layers
Hidden Dimension 512 512
Vocabulary Size 4,466 35,560
Training Dataset TinyChat only TinyStories + TinyChat + HQ Sentences
Total Tokens ~1M conversations 3,000,000+ tokens
Final Loss ~2.0 ~2.0
Final Perplexity 7.29-9.70 7.29-10.0
Training Time ~17 hours ~2-4 hours
Attention Layers None (Pure Hybrid) 6 Full Attention Layers

Key Improvements Over i3-22M

  1. Hybrid Architecture: Introduces full multi-head attention in upper layers for better long-range dependencies
  2. Larger Vocabulary: 8x larger vocabulary (35,560 vs 4,466) for better token coverage
  3. Multi-Dataset Training: Trained on 3 diverse datasets vs single dataset
  4. Better Generalization: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences)
  5. Enhanced Unknown Token Handling: Robust token system for out-of-vocabulary words

When to Use Each Model

Use i3-22M if you need:

  • Smaller model size (~22M params)
  • Pure conversational focus (TinyChat specialized)
  • Lower memory footprint
  • Faster inference

Use i3-80M if you need:

  • Better general-purpose text generation
  • Stronger attention-based reasoning (6 attention layers)
  • Larger vocabulary coverage
  • Multi-domain text understanding (stories, chat, formal text)

Key Features

  1. Hybrid Architecture: Combines the efficiency of recurrent/convolutional processing with the power of attention

    • Early layers use RWKV-Mamba hybrid for efficient sequence processing
    • Later layers use full multi-head attention for complex pattern recognition
  2. Memory-Optimized Training:

    • Streaming vocabulary building (no full text storage)
    • Vocabulary caching (build once, reuse)
    • Efficient chunk frequency counting
    • Automatic memory cleanup
  3. Multi-Dataset Pre-training: Trained on diverse text sources for robust language understanding

    • TinyStories: Narrative and storytelling
    • TinyChat: Conversational dynamics
    • High-Quality English Sentences: Linguistic diversity
  4. Smart Tokenization: Variable-length chunking (2-3 chars) with common trigram optimization

    • Total tokens processed: 3,000,000+
    • Handles unknown tokens gracefully with token

Training Details

Training Configuration

  • Datasets:
    • agentlans/high-quality-english-sentences
    • roneneldan/TinyStories
    • starhopp3r/TinyChat
  • Training Steps: 5,000 iterations
  • Batch Size: 4 (with gradient accumulation support)
  • Learning Rate: 3e-4 (with warmup and cosine decay)
  • Optimizer: AdamW with gradient clipping (max norm: 1.0)
  • Hardware: NVIDIA GeForce RTX 3060 (12GB VRAM)
  • Training Time: ~2-4 hours
  • Framework: PyTorch

Training Dynamics

  • GPU Utilization: Stable at ~15-20% during training
  • GPU Memory: ~18% allocated (~2.2GB / 12GB)
  • Power Usage: ~40W average
  • Throughput: ~100-550 tokens/sec

Performance Metrics

Metric Initial Final Best
Training Loss ~6.0 ~2.0 1.98
Perplexity ~400+ ~7-10 7.29

image

I dont know why the logging starts at step 4.6k .

i3-22m and i3-80m comparation?

image

The model shows strong convergence with stable training dynamics and efficient GPU utilization.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m")
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m")

# Generate text
prompt = "hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.8,
    top_k=40
)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)

For custom usage with the original training code, check user.py.

Technical Innovations

  1. RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics

    • Linear complexity for long sequences
    • Efficient recurrent processing
    • State-space modeling for temporal dependencies
  2. Hierarchical Processing:

    • Lower layers focus on local patterns (conv/recurrent)
    • Upper layers capture global dependencies (attention)
  3. Memory Efficiency:

    • Streaming tokenization during vocab building
    • No full dataset storage in RAM
    • Automatic cleanup of intermediate data

Model Files

  • pytorch_model.bin: Model weights
  • config.json: Model configuration
  • chunk_vocab_combined.json: Tokenizer vocabulary

Training Tracking

This model was tracked using Weights & Biases (WandB) with comprehensive metrics:

  • Real-time loss and perplexity tracking
  • Gradient norm monitoring
  • Learning rate scheduling visualization
  • Generation samples logged to tables
  • Model checkpoints as artifacts
  • System resource monitoring

Limitations

  • Trained on English text only
  • Limited to 256 token context window
  • May require fine-tuning for specific downstream tasks
  • Conversational style influenced by TinyChat dataset

Model Series

  • i3-22M - Original model with pure hybrid architecture
  • i3-80M (This model) - Scaled version with attention layers and multi-dataset training

Citation

@misc{i3-80m,
  author = {FlameF0X},
  title = {i3-80M: Hybrid Architecture Language Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}}
}
Downloads last month
68
Safetensors
Model size
82.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train FlameF0X/i3-80m

Collection including FlameF0X/i3-80m