Architecture code included.

Browse files

Files changed (6) hide show

.gitattributes +1 -0
architecture/README.md +82 -0
architecture/__init__.py +2 -0
architecture/architecture.png +3 -0
architecture/gemma3.py +130 -0
architecture/model_config.py +16 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+architecture/architecture.png filter=lfs diff=lfs merge=lfs -text

architecture/README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Architecture Module
+This module contains the main Gemma3 model implementation and configuration management.
+## Files
+### `gemma3.py`
+The core Gemma3Model class implementation featuring:
+- **Token Embeddings**: Scaled embedding layer with vocabulary size of 50,257
+- **Transformer Blocks**: 18 layers with mixed attention patterns (sliding window and full attention)
+- **Dual RoPE**: Two sets of rotary position embeddings for local and global context
+- **Attention Masks**: Dynamic generation of causal and sliding window masks
+- **Output Head**: Linear projection to vocabulary size for next-token prediction
+- **Generation Method**: Temperature-controlled sampling with top-k filtering
+Key components:
+- `__init__`: Initializes model layers, embeddings, and precomputes RoPE parameters
+- `_create_masks`: Generates causal and sliding window attention masks
+- `forward`: Main forward pass with optional loss computation
+- `generate`: Autoregressive text generation with temperature and top-k sampling
+### `model_config.py`
+Configuration loader that reads model hyperparameters from `config/model_config.json`.
+### `__init__.py`
+Module initialization that exports:
+- `model_config`: Dictionary containing all model hyperparameters
+- `Gemma3Model`: The main model class
+## Model Architecture Details
+### Layer Configuration
+The model uses a strategic mix of attention types across 18 layers:
+- **Layers 1-5**: Sliding window attention (512 token window)
+- **Layer 6**: Full attention (checkpoint layer)
+- **Layers 7-11**: Sliding window attention
+- **Layer 12**: Full attention (checkpoint layer)
+- **Layers 13-17**: Sliding window attention
+- **Layer 18**: Full attention (final layer)
+This pattern allows the model to:
+- Efficiently process local context with sliding windows
+- Capture long-range dependencies at strategic checkpoints
+- Balance computational efficiency with modeling capability
+### Embedding and Normalization
+- **Embedding Scaling**: Input embeddings are scaled by √(embedding_dim) for training stability
+- **Final Normalization**: RMS normalization before the output projection
+- **Weight Tying**: Output projection weights are separate from input embeddings
+### Position Encoding
+The model uses dual RoPE (Rotary Position Embeddings):
+- **Local RoPE**: θ_base = 10,000 for sliding window attention
+- **Global RoPE**: θ_base = 1,000,000 for full attention layers
+This dual approach allows different attention patterns to use position encodings optimized for their respective context ranges.
+## Usage Example
+```python
+from architecture import Gemma3Model, model_config
+import torch
+# Initialize model
+model = Gemma3Model(model_config)
+# Forward pass
+input_ids = torch.randint(0, 50257, (2, 128))  # batch_size=2, seq_len=128
+logits, loss = model(input_ids, targets=None)
+# Generation
+prompt = torch.randint(0, 50257, (1, 10))  # Single prompt
+generated = model.generate(prompt, max_new_tokens=50, temperature=0.8, top_k=40)
+```
+## Design Decisions
+1. **Mixed Attention**: Combines efficiency of sliding windows with the modeling power of full attention
+2. **Separate RoPE Bases**: Optimizes position encoding for different attention ranges
+3. **Grouped Query Attention**: Reduces KV cache memory while maintaining performance
+4. **Gemma3-style Normalization**: Uses (1 + weight) scaling for better training dynamics

architecture/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .gemma3 import Gemma3Model
2	+ from .model_config import model_config

architecture/architecture.png ADDED Viewed

Git LFS Details

SHA256: 2139dbfc89d90f77abeb28e454bb03b000becf39fa63bf330dec8bca0010a3cf
Pointer size: 131 Bytes
Size of remote file: 260 kB

architecture/gemma3.py ADDED Viewed

	@@ -0,0 +1,130 @@

+import os, sys
+from os.path import dirname as up
+sys.path.append(os.path.abspath(os.path.join(up(__file__), os.pardir)))
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from block.transformer import TransformerBlock
+from block.rms_norm import RMSNorm
+from block.rope import compute_rope_params
+class Gemma3Model(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+        assert cfg["layer_types"] is not None and len(cfg["layer_types"]) == cfg["n_layers"]
+        # Main model parameters
+        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
+        self.blocks = nn.ModuleList([
+            TransformerBlock(cfg, attn_type)for attn_type in cfg["layer_types"]
+        ])
+        self.final_norm = RMSNorm(cfg["emb_dim"], eps=1e-6)
+        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])
+        self.cfg = cfg
+        # Reusuable utilities
+        cos_local, sin_local = compute_rope_params(
+            head_dim=cfg["head_dim"],
+            theta_base=cfg["rope_local_base"],
+            context_length=cfg["context_length"],
+            dtype=torch.float32,
+        )
+        cos_global, sin_global = compute_rope_params(
+            head_dim=cfg["head_dim"],
+            theta_base=cfg["rope_base"],
+            context_length=cfg["context_length"],
+            dtype=torch.float32,
+        )
+        self.register_buffer("cos_local", cos_local, persistent=False)
+        self.register_buffer("sin_local", sin_local, persistent=False)
+        self.register_buffer("cos_global", cos_global, persistent=False)
+        self.register_buffer("sin_global", sin_global, persistent=False)
+    def _create_masks(self, seq_len, device):
+        ones = torch.ones((seq_len, seq_len), dtype=torch.bool, device=device)
+        # mask_global (future is masked: j > i)
+        #     j:  0 1 2 3 4 5 6 7
+        #  i
+        #     0:  0 1 1 1 1 1 1 1
+        #     1:  0 0 1 1 1 1 1 1
+        #     2:  0 0 0 1 1 1 1 1
+        #     3:  0 0 0 0 1 1 1 1
+        #     4:  0 0 0 0 0 1 1 1
+        #     5:  0 0 0 0 0 0 1 1
+        #     6:  0 0 0 0 0 0 0 1
+        #     7:  0 0 0 0 0 0 0 0
+        mask_global = torch.triu(ones, diagonal=1)
+        # far_past (too far back is masked: i - j >= sliding_window)
+        # where sliding_window = 4
+        #     j:  0 1 2 3 4 5 6 7
+        #  i
+        #     0:  0 0 0 0 0 0 0 0
+        #     1:  0 0 0 0 0 0 0 0
+        #     2:  0 0 0 0 0 0 0 0
+        #     3:  0 0 0 0 0 0 0 0
+        #     4:  1 0 0 0 0 0 0 0
+        #     5:  1 1 0 0 0 0 0 0
+        #     6:  1 1 1 0 0 0 0 0
+        #     7:  1 1 1 1 0 0 0 0
+        far_past = torch.triu(ones, diagonal=self.cfg["sliding_window"]).T
+        # Local (sliding_window) = future OR far-past
+        # mask_local
+        #     j:  0 1 2 3 4 5 6 7
+        # i
+        # 0:      0 1 1 1 1 1 1 1
+        # 1:      0 0 1 1 1 1 1 1
+        # 2:      0 0 0 1 1 1 1 1
+        # 3:      0 0 0 0 1 1 1 1
+        # 4:      1 0 0 0 0 1 1 1
+        # 5:      1 1 0 0 0 0 1 1
+        # 6:      1 1 1 0 0 0 0 1
+        # 7:      1 1 1 1 0 0 0 0
+        mask_local = mask_global | far_past
+        return mask_global, mask_local
+    def forward(self, input_ids, targets=None):
+        b, seq_len = input_ids.shape
+        x = self.tok_emb(input_ids) * (self.cfg["emb_dim"] ** 0.5)
+        mask_global, mask_local = self._create_masks(seq_len, x.device)
+        for block in self.blocks:
+            x = block(
+                x,
+                mask_global=mask_global,
+                mask_local=mask_local,
+                cos_global=self.cos_global,
+                sin_global=self.sin_global,
+                cos_local=self.cos_local,
+                sin_local=self.sin_local,
+            )
+        x = self.final_norm(x)
+        logits = self.out_head(x.to(self.cfg["dtype"]))
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), targets.reshape(-1))
+        return logits, loss
+    @torch.no_grad()
+    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
+      for _ in range(max_new_tokens):
+        ctx_len = self.cfg["context_length"]
+        idx_cond = idx if idx.size(1) <= ctx_len else idx[:, -ctx_len:]
+        logits, _ = self(idx_cond)  # targets=None by default
+        logits = logits[:, -1, :] / temperature
+        if top_k is not None:
+            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+            logits[logits < v[:, [-1]]] = float("-inf")
+        probs = F.softmax(logits, dim=-1)
+        idx_next = torch.multinomial(probs, num_samples=1)
+        idx = torch.cat((idx, idx_next), dim=1)
+      return idx

architecture/model_config.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import os, sys
+from os.path import dirname as up
+sys.path.append(os.path.abspath(os.path.join(up(__file__), os.pardir)))
+import json
+MODEL_CONFIG_PATH = 'config/model_config.json'
+with open(MODEL_CONFIG_PATH, 'r') as f:
+    model_config = json.load(f)
+# print(model_config)