# Leap-0 This repository contains the implementation of a lightweight, modified version of the GPT architecture **Leap-0** trained from scratch using FineWeb-Edu, an open-source dataset. The project demonstrates the design, training, and optimization of a custom natural language model on local hardware.
Description of the image

Figure 1: Architecture of Leap

## Features - **Custom GPT Architecture**: A miniaturized version of the GPT model tailored for efficient training on limited hardware. - **Local Training**: Complete model training executed on local resources, enabling cost-effective development. - **Open-Source Datasets**: Trained using publicly available FineWeb-Edu dataset to ensure accessibility and reproducibility. - **Scalable Design**: Architecture optimized for experimentation and scalability while maintaining resource efficiency. ## Implementation Details 1. **Model Architecture** - A streamlined GPT-based architecture designed for reduced complexity and improved training efficiency. - Incorporates modifications to parameter scaling to suit resource-constrained environments. 2. **Training** - Training executed locally on NVIDIA GeForce RTX 4500 ada 24GB GPU, leveraging PyTorch. 3. **Testing** - A simple Streamlit UI created for testing generation capability of the model. ## Model Architecture ### Configuration - **Sequence Length:** 512 tokens - **Vocabulary Size:** 48,951 tokens - Includes 50,000 BPE merges, 256 special byte tokens, and 1 `<|endoftext|>` token. - **Number of Layers:** 4 transformer blocks - **Attention Heads:** 8 per block - **Embedding Dimension:** 512 - **Dropout:** 0.1 ### Components 1. **Embeddings:** - **Word Embeddings (`wte`):** Learnable token embeddings of size `n_embd`. - **Position Embeddings (`wpe`):** Learnable positional embeddings for sequences up to `block_size`. 2. **Transformer Blocks:** - A stack of 4 transformer blocks, each comprising: - Multi-head self-attention mechanisms. - Feedforward networks for feature transformation. 3. **Output Head:** - **Linear Layer (`lm_head`):** Maps hidden states to logits for token predictions. - Implements weight sharing between token embeddings (`wte`) and output projection for parameter efficiency. 4. **Layer Normalization:** - Final layer normalization (`ln_f`) ensures stable optimization. ## Current Status: 1. Dataset Used: FineWeb-Edu (18.5 GB) entirely. 2. Training Steps: 5000 3. Time Taken: ~ 7 hours 4. File format: .pt ## Requirements - Python 3.8+ - PyTorch 2.0+ or TensorFlow 2.10+ - CUDA-enabled GPU with at least 4GB VRAM (recommended) - Dependencies listed in `requirements.txt` - **Note**: Different OS support different versions of PyTorch/Tensorflow to use CUDA (local GPU). Install only after verifying for your OS.