# Leap-0

This repository contains the implementation of a lightweight, modified version of the GPT architecture **Leap-0** trained from scratch using FineWeb-Edu, an open-source dataset. The project demonstrates the design, training, and optimization of a custom natural language model on local hardware.  

<div align="center">
  <img src="LLM.drawio.png" alt="Description of the image" width="300">
   <p><strong>Figure 1: Architecture of Leap</p>
</div>

## Features  
- **Custom GPT Architecture**: A miniaturized version of the GPT model tailored for efficient training on limited hardware.  
- **Local Training**: Complete model training executed on local resources, enabling cost-effective development.  
- **Open-Source Datasets**: Trained using publicly available FineWeb-Edu dataset to ensure accessibility and reproducibility.  
- **Scalable Design**: Architecture optimized for experimentation and scalability while maintaining resource efficiency.  


## Implementation Details  
1. **Model Architecture**  
   - A streamlined GPT-based architecture designed for reduced complexity and improved training efficiency.  
   - Incorporates modifications to parameter scaling to suit resource-constrained environments.  

2. **Training**  
   - Training executed locally on NVIDIA GeForce RTX 4500 ada 24GB GPU, leveraging PyTorch.
    
3. **Testing**
   - A simple Streamlit UI created for testing generation capability of the model.

## Model Architecture

### Configuration  
- **Sequence Length:** 512 tokens  
- **Vocabulary Size:** 48,951 tokens  
  - Includes 50,000 BPE merges, 256 special byte tokens, and 1 `<|endoftext|>` token.  
- **Number of Layers:** 4 transformer blocks  
- **Attention Heads:** 8 per block  
- **Embedding Dimension:** 512  
- **Dropout:** 0.1  

### Components  
1. **Embeddings:**  
   - **Word Embeddings (`wte`):** Learnable token embeddings of size `n_embd`.  
   - **Position Embeddings (`wpe`):** Learnable positional embeddings for sequences up to `block_size`.  

2. **Transformer Blocks:**  
   - A stack of 4 transformer blocks, each comprising:  
     - Multi-head self-attention mechanisms.  
     - Feedforward networks for feature transformation.  

3. **Output Head:**  
   - **Linear Layer (`lm_head`):** Maps hidden states to logits for token predictions.  
   - Implements weight sharing between token embeddings (`wte`) and output projection for parameter efficiency.  

4. **Layer Normalization:**  
   - Final layer normalization (`ln_f`) ensures stable optimization.  


## Current Status:
1. Dataset Used: FineWeb-Edu (18.5 GB) entirely.
2. Training Steps: 5000
3. Time Taken: ~ 7 hours
4. File format: .pt

## Requirements  
- Python 3.8+  
- PyTorch 2.0+ or TensorFlow 2.10+  
- CUDA-enabled GPU with at least 4GB VRAM (recommended)  
- Dependencies listed in `requirements.txt`
- **Note**: Different OS support different versions of PyTorch/Tensorflow to use CUDA (local GPU). Install only after verifying for your OS.