Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

.gitignore +7 -0
LICENSE +21 -0
README.md +170 -3
config.py +9 -0
models/__init__.py +0 -0
models/model.py +113 -0
scripts/generate.py +183 -0
scripts/memory.py +42 -0
scripts/prepare_data.py +26 -0
scripts/tokenizer_setup.py +33 -0
scripts/train.py +133 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+data/*
+*.pt
+*.json
+.idea
+__pycache__
+venv
+memory.db

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Brett Moore
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,3 +1,170 @@
----
-license: mit
----

+# Microformer
+**Microformer** is a minimal, educational-scale transformer language model built from scratch in PyTorch.
+Inspired by [nanoGPT](https://github.com/karpathy/nanoGPT) and OpenAI’s GPT-1, Microformer is designed for learning, experimentation, and prototyping on lightweight datasets like [text8](https://mattmahoney.net/dc/textdata.html) or Tiny Shakespeare.
+---
+## Features
+- Decoder-only transformer (GPT-style) architecture
+- **Stacked adapters per layer for dual-memory:**
+    - **Long-term adapters** (for corpus/knowledge facts)
+    - **Session adapters** (for rapid, online, user/session-specific learning)
+- Choice of character-level **or** subword/BPE tokenization (configurable)
+- Learnable positional encoding
+- Multi-head self-attention
+- Configurable depth, embedding size, sequence length, and attention heads
+- Simple end-to-end pipeline: preprocessing, training, and text generation
+- Modular, readable code ideal for educational use and tinkering
+- Temperature and multinomial sampling in text generation
+---
+## What’s Unique: Stacked Adapters for Dual-Memory Learning
+Microformer implements **two adapters in every transformer block**:
+- **Long-term adapter:**
+  Trained with your full corpus during batch/corpus training.
+  Stores stable, general “knowledge” (e.g., literary style, factual info).
+- **Session adapter:**
+  Starts blank and is trained *on the fly* during chat or interactive teaching.
+  Lets you rapidly “teach” new facts, styles, or user preferences without overwriting core knowledge.
+At inference, the outputs of both adapters (plus the core transformer) are combined—giving the model both stable and flexible, session-specific memory, just like a human brain’s “temporal lobe” and “core memory”.
+---
+## Project Structure
+```
+microformer/
+├── config.py              # Hyperparameters and model settings
+├── data/
+│   ├── corpus.txt         # Raw training text
+│   ├── train.pt           # Preprocessed training tensor (token IDs)
+│   ├── val.pt             # Validation tensor (token IDs)
+│   ├── vocab.json         # Vocabulary (char or subword, stoi/itos mapping)
+│   └── tokenizer.json     # (optional) BPE tokenizer file if using subwords
+├── models/
+│   └── model.py           # Transformer model definition (Microformer)
+├── scripts/
+│   ├── prepare_data.py    # Data preprocessing/tokenization
+│   ├── train.py           # Training script (trains long-term adapters)
+│   ├── generate_text.py   # Inference/generation + online learning (session adapters)
+│   └── tokenizer_setup.py # BPE Tokenizer
+└── README.md
+```
+---
+## Quickstart
+1. **Prepare your corpus and run the tokenizer**
+   Place your text data in `data/corpus.txt`.
+2. **Choose your tokenizer:**
+- **Character-level (default):**
+  No extra steps needed.
+- **BPE/Subword (recommended for rich/modern text):**
+  ```bash
+  python scripts/tokenizer_setup.py --input data/corpus.txt --vocab_size 1000
+  ```
+3. **Prepare the dataset**
+   ```bash
+   python scripts/prepare_data.py
+   ```
+4. **Train the model (long-term knowledge)**
+   ```bash
+   python scripts/train.py
+   ```
+    - This trains only the **long-term adapters** and core weights.
+    - Session adapters remain untrained (blank) until chat time.
+5. **Generate text and teach interactively (session memory)**
+   ```bash
+   python scripts/generate_text.py
+   ```
+    - Loads your trained model.
+    - Prompts for a seed string and temperature.
+    - **Allows you to “teach” new facts on the fly!**
+    - New knowledge is stored in session adapters—does *not* overwrite long-term knowledge.
+---
+## Example Config (`config.py`)
+```python
+EMBED_DIM = 128
+NUM_HEADS = 4
+NUM_LAYERS = 2
+FF_DIM = 256
+MAX_SEQ_LEN = 128
+BATCH_SIZE = 32
+ADAPTER_DIM = 32   # Used for both long-term and session adapters
+VOCAB_SIZE = 100   # Set automatically from tokenizer/vocab
+```
+---
+## Using the Dual-Memory System
+- **Long-term adapters:**
+  Learned during `train.py`—persist between runs.
+- **Session adapters:**
+  Learned during interactive chat in `generate_text.py`—resettable (optional) between users/sessions.
+- **Teach new facts by entering a prompt and providing your ideal answer.**
+  The model will “remember” this during the session, even if it wasn’t present in the training corpus.
+---
+## Customization & Ideas
+- Use BPE/subword tokenization for more expressive modeling (recommended for non-trivial datasets)
+- Add more adapters or experiment with gating (e.g., blend adapters by context)
+- Combine with a key-value retrieval or buffer for truly persistent “user memory”
+- Visualize training with TensorBoard or wandb
+- Tinker with alternative attention or memory mechanisms
+---
+## Requirements
+- Python 3.8+
+- [PyTorch](https://pytorch.org/)
+- [tokenizers](https://github.com/huggingface/tokenizers) (for BPE/subword)
+Install dependencies with:
+```bash
+pip install torch tokenizers
+```
+---
+## Credits
+- Inspired by [nanoGPT](https://github.com/karpathy/nanoGPT) and [minGPT](https://github.com/karpathy/minGPT) by Andrej Karpathy
+- Adapter and continual-learning inspiration from recent NLP research ([Houlsby et al. 2019](https://arxiv.org/abs/1902.00751))
+- Built using concepts from the original [GPT-1 paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
+---
+## License
+MIT License – Use freely for learning and experimentation.
+---
+**Happy tinkering with dual-memory transformers!**

config.py ADDED Viewed

	@@ -0,0 +1,9 @@

+# Hyperparameters and config settings
+EMBED_DIM = 256        # Size of token embeddings
+NUM_HEADS = 8          # Number of attention heads
+NUM_LAYERS = 4         # Number of transformer blocks
+FF_DIM = 512           # Feedforward layer dimension
+MAX_SEQ_LEN = 256      # Maximum sequence length
+VOCAB_SIZE = 100       # Placeholder (will be overridden based on dataset)
+ADAPTER_DIM = 32        # Add in adapter for continual learning

models/__init__.py ADDED Viewed

File without changes

models/model.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import torch
+import torch.nn as nn
+import math
+import torch
+import torch.nn as nn
+import math
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_model, max_len=5000):
+        super().__init__()
+        pe = torch.zeros(max_len, d_model)
+        position = torch.arange(0, max_len).unsqueeze(1).float()
+        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.register_buffer('pe', pe)
+    def forward(self, x):
+        return x + self.pe[:, :x.size(1)]
+class MultiHeadSelfAttention(nn.Module):
+    def __init__(self, embed_dim, num_heads):
+        super().__init__()
+        self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
+    def forward(self, x):
+        attn_output, _ = self.attn(x, x, x)
+        return attn_output
+class FeedForward(nn.Module):
+    def __init__(self, embed_dim, ff_dim):
+        super().__init__()
+        self.ff = nn.Sequential(
+            nn.Linear(embed_dim, ff_dim),
+            nn.ReLU(),
+            nn.Linear(ff_dim, embed_dim)
+        )
+    def forward(self, x):
+        return self.ff(x)
+# --- NEW: Adapter Block ---
+class Adapter(nn.Module):
+    def __init__(self, dim, bottleneck=32):
+        super().__init__()
+        self.down = nn.Linear(dim, bottleneck)
+        self.relu = nn.ReLU()
+        self.up = nn.Linear(bottleneck, dim)
+    def forward(self, x):
+        return x + self.up(self.relu(self.down(x)))  # Residual
+class TransformerBlock(nn.Module):
+    def __init__(self, embed_dim, num_heads, ff_dim,
+                 long_term_adapter_dim=None, session_adapter_dim=None):
+        super().__init__()
+        self.attn = MultiHeadSelfAttention(embed_dim, num_heads)
+        self.norm1 = nn.LayerNorm(embed_dim)
+        self.ff = FeedForward(embed_dim, ff_dim)
+        self.norm2 = nn.LayerNorm(embed_dim)
+        # Two adapters: one for long-term (rarely updated), one for session (online)
+        self.long_term_adapter = Adapter(embed_dim, long_term_adapter_dim) if long_term_adapter_dim else None
+        self.session_adapter = Adapter(embed_dim, session_adapter_dim) if session_adapter_dim else None
+    def forward(self, x):
+        x = self.norm1(x + self.attn(x))
+        x = self.norm2(x + self.ff(x))
+        # Add both adapters’ outputs, if present
+        if self.long_term_adapter is not None:
+            x = self.long_term_adapter(x)
+        if self.session_adapter is not None:
+            x = self.session_adapter(x)
+        return x
+class Microformer(nn.Module):
+    def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_layers, max_seq_len,
+                 long_term_adapter_dim=None, session_adapter_dim=None):
+        super().__init__()
+        self.embedding = nn.Embedding(vocab_size, embed_dim)
+        self.positional_encoding = PositionalEncoding(embed_dim, max_seq_len)
+        self.layers = nn.ModuleList([
+            TransformerBlock(
+                embed_dim, num_heads, ff_dim,
+                long_term_adapter_dim=long_term_adapter_dim,
+                session_adapter_dim=session_adapter_dim
+            )
+            for _ in range(num_layers)
+        ])
+        self.output = nn.Linear(embed_dim, vocab_size)
+    def forward(self, x):
+        x = self.embedding(x)
+        x = self.positional_encoding(x)
+        for layer in self.layers:
+            x = layer(x)
+        return self.output(x)
+    def freeze_except_adapters(self, session_only=True, include_output=True):
+        for param in self.parameters():
+            param.requires_grad = False
+        for layer in self.layers:
+            if getattr(layer, 'session_adapter', None) is not None:
+                for param in layer.session_adapter.parameters():
+                    param.requires_grad = True
+            if not session_only and getattr(layer, 'long_term_adapter', None) is not None:
+                for param in layer.long_term_adapter.parameters():
+                    param.requires_grad = True
+        if include_output:
+            for param in self.output.parameters():
+                param.requires_grad = True

scripts/generate.py ADDED Viewed

	@@ -0,0 +1,183 @@

+import sys
+from pathlib import Path
+sys.path.append(str(Path(__file__).resolve().parent.parent))
+import torch
+import torch.nn.functional as F
+import torch.nn as nn
+import torch.optim as optim
+from models.model import Microformer
+from tokenizers import Tokenizer
+from config import VOCAB_SIZE, EMBED_DIM, NUM_HEADS, FF_DIM, NUM_LAYERS, MAX_SEQ_LEN, ADAPTER_DIM
+import sqlite3
+from datetime import datetime
+# --- Load tokenizer and model ---
+tokenizer = Tokenizer.from_file("data/tokenizer.json")
+VOCAB_SIZE = tokenizer.get_vocab_size()
+model = Microformer(
+    vocab_size=VOCAB_SIZE,
+    embed_dim=EMBED_DIM,
+    num_heads=NUM_HEADS,
+    ff_dim=FF_DIM,
+    num_layers=NUM_LAYERS,
+    max_seq_len=MAX_SEQ_LEN,
+    long_term_adapter_dim=ADAPTER_DIM,
+    session_adapter_dim=ADAPTER_DIM
+)
+model.load_state_dict(torch.load("microformer.pt"))
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model.to(device)
+# --- Freeze all but session adapters and output for online learning ---
+model.freeze_except_adapters(session_only=True, include_output=True)
+criterion = nn.CrossEntropyLoss()
+optimizer = optim.Adam(
+    filter(lambda p: p.requires_grad, model.parameters()),
+    lr=1e-2  # High LR for visible learning during teaching
+)
+# --- Memory DB setup ---
+conn = sqlite3.connect("memory.db")
+c = conn.cursor()
+c.execute("""
+CREATE TABLE IF NOT EXISTS memory (
+    timestamp TEXT,
+    prompt TEXT,
+    response TEXT
+)
+""")
+conn.commit()
+def top_k_top_p_filtering(logits, top_k=50, top_p=0.9):
+    logits = logits.squeeze(0)  # [1, vocab] → [vocab]
+    probs = torch.softmax(logits, dim=-1)
+    # Sort probabilities
+    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
+    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
+    # Top-p mask
+    sorted_mask = cumulative_probs > top_p
+    sorted_mask[1:] = sorted_mask[:-1].clone()
+    sorted_mask[0] = False
+    # Top-k mask
+    if top_k < sorted_probs.size(0):
+        sorted_mask[top_k:] = True
+    # Zero out masked values
+    sorted_probs[sorted_mask] = 0.0
+    # Normalize and sample
+    sorted_probs /= sorted_probs.sum()
+    sampled_relative_index = torch.multinomial(sorted_probs, 1).item()
+    sampled_token_id = sorted_indices[sampled_relative_index].item()
+    return sampled_token_id
+def generate(prompt, length=100, temperature=1.0, top_p=0.9, top_k=50):
+    input_ids = tokenizer.encode(prompt).ids
+    input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
+    eos_token_id = tokenizer.token_to_id("<EOS>")
+    for _ in range(length):
+        with torch.no_grad():
+            logits = model(input_tensor)
+            logits = logits[:, -1, :] / temperature
+            # Repetition penalty
+            for token_id in input_tensor[0].tolist():
+                logits[0, token_id] *= 0.8
+            next_token_id = top_k_top_p_filtering(logits, top_k=top_k, top_p=top_p)
+        input_tensor = torch.cat([input_tensor, torch.tensor([[next_token_id]], device=device)], dim=1)
+        if next_token_id == eos_token_id:
+            break
+    output_ids = input_tensor[0].tolist()
+    decoded = tokenizer.decode(output_ids)
+    if "<EOS>" in decoded:
+        decoded = decoded.split("<EOS>")[0].strip()
+    return decoded
+def online_unsupervised_update(model, tokenizer, text, optimizer, loss_fn, device, max_len=64):
+    # Always called after freeze_except_adapters(session_only=True)
+    ids = tokenizer.encode(text).ids + [tokenizer.token_to_id("<EOS>")]
+    if len(ids) < 2:
+        return None
+    ids = ids[:max_len + 1]
+    input_ids = ids[:-1]
+    target_ids = ids[1:]
+    input_ids += [tokenizer.token_to_id("<PAD>")] * (max_len - len(input_ids))
+    target_ids += [tokenizer.token_to_id("<PAD>")] * (max_len - len(target_ids))
+    input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
+    target_tensor = torch.tensor([target_ids], dtype=torch.long, device=device)
+    model.train()
+    logits = model(input_tensor)
+    logits = logits.view(-1, logits.size(-1))
+    targets = target_tensor.view(-1)
+    loss = loss_fn(logits, targets)
+    optimizer.zero_grad()
+    loss.backward()
+    optimizer.step()
+    model.eval()
+    return loss.item()
+# Optional: Reset session adapter weights between sessions
+def reset_session_adapters(model):
+    for layer in model.layers:
+        if getattr(layer, 'session_adapter', None) is not None:
+            for param in layer.session_adapter.parameters():
+                if param.data is not None:
+                    nn.init.zeros_(param.data)
+if __name__ == "__main__":
+    while True:
+        prompt = input("\nEnter a prompt (or 'exit' to quit): ")
+        if prompt.lower() in {"exit", "quit"}:
+            break
+        temp = float(input("Temperature (e.g. 0.7, 1.0): "))
+        output = generate(prompt, length=100, temperature=temp, top_p=0.9, top_k=50)
+        print("\nGenerated text:\n")
+        print(output)
+        # Online learning: always update session adapters only
+        teach = input("\nDo you want to teach the model a better answer? (y/N): ").strip().lower()
+        if teach == "y":
+            your_answer = input("Type your ideal response for this prompt: ")
+            model.freeze_except_adapters(session_only=True, include_output=True)
+            online_text = prompt + " " + your_answer
+            loss = online_unsupervised_update(
+                model, tokenizer, online_text, optimizer, criterion, device, max_len=MAX_SEQ_LEN
+            )
+            print(f"[Online update loss: {loss:.4f}]")
+        else:
+            model.freeze_except_adapters(session_only=True, include_output=True)
+            online_text = prompt + " " + output
+            loss = online_unsupervised_update(
+                model, tokenizer, online_text, optimizer, criterion, device, max_len=MAX_SEQ_LEN
+            )
+            print(f"[Online (self-improve) update loss: {loss:.4f}]")
+        # Store the interaction in memory DB as before
+        c.execute("INSERT INTO memory (timestamp, prompt, response) VALUES (?, ?, ?)",
+                  (datetime.now().isoformat(timespec='seconds'), prompt, output))
+        conn.commit()
+        print("\nRecent memory:")
+        for row in c.execute("SELECT * FROM memory ORDER BY timestamp DESC LIMIT 5"):
+            print(f"[{row[0]}] {row[1]} → {row[2]}")
+        # Optional: Uncomment to reset fast-memory (session adapters) between users/sessions
+        # reset_session_adapters(model)

scripts/memory.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import sys
+from pathlib import Path
+sys.path.append(str(Path(__file__).resolve().parent.parent))
+import sqlite3
+from datetime import datetime
+# Connect to SQLite database (will create if it doesn't exist)
+conn = sqlite3.connect("memory.db")
+cursor = conn.cursor()
+# Create memory table if it doesn't exist
+cursor.execute("""
+CREATE TABLE IF NOT EXISTS memory (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    prompt TEXT NOT NULL,
+    response TEXT NOT NULL,
+    timestamp TEXT DEFAULT CURRENT_TIMESTAMP
+)
+""")
+conn.commit()
+def save_memory(prompt: str, response: str):
+    """Save a prompt-response pair to the memory database."""
+    cursor.execute(
+        "INSERT INTO memory (prompt, response) VALUES (?, ?)",
+        (prompt, response)
+    )
+    conn.commit()
+def recall_memories(limit: int = 5):
+    """Retrieve the most recent prompt-response pairs."""
+    cursor.execute(
+        "SELECT prompt, response, timestamp FROM memory ORDER BY timestamp DESC LIMIT ?",
+        (limit,)
+    )
+    return cursor.fetchall()
+def clear_memory():
+    """Delete all memory records."""
+    cursor.execute("DELETE FROM memory")
+    conn.commit()

scripts/prepare_data.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import torch
+import json
+import numpy
+from tokenizers import Tokenizer
+from pathlib import Path
+# Load tokenizer
+tokenizer = Tokenizer.from_file("data/tokenizer.json")
+VOCAB_SIZE = tokenizer.get_vocab_size()
+# Load corpus
+with open("data/corpus.txt", "r", encoding="utf-8") as f:
+    text = f.read()
+# Encode with BPE tokenizer
+encoded = tokenizer.encode(text).ids
+# Convert to tensor and split into train/val
+data = torch.tensor(encoded, dtype=torch.long)
+split = int(0.9 * len(data))
+train_data = data[:split]
+val_data = data[split:]
+# Save outputs
+torch.save(train_data, "data/train.pt")
+torch.save(val_data, "data/val.pt")

scripts/tokenizer_setup.py ADDED Viewed

	@@ -0,0 +1,33 @@

+from tokenizers import Tokenizer, models, trainers, pre_tokenizers
+from pathlib import Path
+import json
+# Paths
+corpus_path = Path("data/corpus.txt")
+tokenizer_path = Path("data/tokenizer.json")
+# Read corpus
+with corpus_path.open("r", encoding="utf-8") as f:
+    lines = [line.strip() for line in f if line.strip()]
+# Initialize tokenizer with BPE model
+tokenizer = Tokenizer(models.BPE())
+tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
+                              pre_tokenizers.Whitespace(),
+                              pre_tokenizers.Punctuation()
+                          ])
+# Train tokenizer
+trainer = trainers.BpeTrainer(vocab_size=5000, special_tokens=["<PAD>", "<UNK>", "<EOS>"])
+tokenizer.train_from_iterator(lines, trainer)
+# Save tokenizer
+tokenizer.save(str(tokenizer_path))
+# Create vocab.json for compatibility
+vocab = tokenizer.get_vocab()
+stoi = vocab
+itos = {v: k for k, v in vocab.items()}
+with open("data/vocab.json", "w") as f:
+    json.dump({"stoi": stoi, "itos": itos}, f)

scripts/train.py ADDED Viewed

	@@ -0,0 +1,133 @@

+import sys
+from pathlib import Path
+sys.path.append(str(Path(__file__).resolve().parent.parent))
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import json
+from models.model import Microformer
+from config import *
+# ------------------------
+# LOAD DATA AND VOCAB
+# ------------------------
+with open("data/vocab.json", "r") as f:
+    vocab = json.load(f)
+    stoi = vocab["stoi"]
+    itos = {int(k): v for k, v in vocab["itos"].items()}
+VOCAB_SIZE = len(stoi)
+data = torch.load("data/train.pt")
+SEQ_LEN = MAX_SEQ_LEN
+BATCH_SIZE = 32
+# Drop remainder for clean batch shape
+num_batches = len(data) // (SEQ_LEN * BATCH_SIZE)
+trimmed_len = num_batches * SEQ_LEN * BATCH_SIZE
+data = data[:trimmed_len]
+data = data.view(BATCH_SIZE, -1)  # shape: (BATCH_SIZE, n_chunks)
+def get_batch(start_idx):
+    x = data[:, start_idx:start_idx+SEQ_LEN]
+    y = data[:, start_idx+1:start_idx+SEQ_LEN+1]
+    return x, y
+# ------------------------
+# DEVICE SETUP
+# ------------------------
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# ------------------------
+# MODEL INSTANTIATION (with stacked adapters)
+# ------------------------
+model = Microformer(
+    VOCAB_SIZE,
+    EMBED_DIM,
+    NUM_HEADS,
+    FF_DIM,
+    NUM_LAYERS,
+    MAX_SEQ_LEN,
+    long_term_adapter_dim=ADAPTER_DIM,     # <-- set in config
+    session_adapter_dim=ADAPTER_DIM        # <-- set in config
+)
+model.to(device)
+# ------------------------
+# TRAIN LONG-TERM ADAPTERS ONLY
+# ------------------------
+model.freeze_except_adapters(session_only=False, include_output=True)
+# (Optionally, explicitly freeze session adapters:)
+for layer in model.layers:
+    if getattr(layer, 'session_adapter', None) is not None:
+        for param in layer.session_adapter.parameters():
+            param.requires_grad = False
+criterion = nn.CrossEntropyLoss()
+optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
+# ------------------------
+# MAIN BATCH TRAINING LOOP (CORPUS)
+# ------------------------
+for epoch in range(6):
+    for i in range(0, data.shape[1] - SEQ_LEN, SEQ_LEN):
+        inputs, targets = get_batch(i)
+        inputs, targets = inputs.to(device), targets.to(device)
+        optimizer.zero_grad()
+        out = model(inputs)
+        loss = criterion(out.reshape(-1, VOCAB_SIZE), targets.reshape(-1))
+        loss.backward()
+        optimizer.step()
+    print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
+torch.save(model.state_dict(), "microformer.pt")
+# ------------------------
+# ONLINE (SESSION) LEARNING UTILITY
+# ------------------------
+def online_unsupervised_update(model, tokenizer, text, optimizer, loss_fn, device, max_len=64):
+    # Only update session adapters/output layer; call freeze_except_adapters before this as needed.
+    ids = tokenizer.encode(text).ids + [tokenizer.token_to_id("<EOS>")]
+    if len(ids) < 2:
+        return None  # not enough tokens
+    ids = ids[:max_len + 1]
+    input_ids = ids[:-1]
+    target_ids = ids[1:]
+    input_ids += [tokenizer.token_to_id("<PAD>")] * (max_len - len(input_ids))
+    target_ids += [tokenizer.token_to_id("<PAD>")] * (max_len - len(target_ids))
+    input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)
+    target_tensor = torch.tensor([target_ids], dtype=torch.long, device=device)
+    model.train()
+    logits = model(input_tensor)
+    logits = logits.view(-1, logits.size(-1))
+    targets = target_tensor.view(-1)
+    loss = loss_fn(logits, targets)
+    optimizer.zero_grad()
+    loss.backward()
+    optimizer.step()
+    model.eval()
+    return loss.item()
+# ------------------------
+# SESSION ADAPTER RESET FUNCTION (OPTIONAL)
+# ------------------------
+def reset_session_adapters(model):
+    for layer in model.layers:
+        if getattr(layer, 'session_adapter', None) is not None:
+            for param in layer.session_adapter.parameters():
+                if param.data is not None:
+                    nn.init.zeros_(param.data)
+# ------------------------
+# USAGE FOR ONLINE LEARNING (after chat, NOT in main batch loop):
+# ------------------------
+# from tokenizers import Tokenizer
+# tokenizer = Tokenizer.from_file("data/tokenizer.json")
+# model.freeze_except_adapters(session_only=True, include_output=True)
+# message = "Who is Buck?"
+# loss = online_unsupervised_update(model, tokenizer, message, optimizer, criterion, device, max_len=SEQ_LEN)
+# print(f"Online update loss: {loss}")