YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Leap-0

This repository contains the implementation of a lightweight, modified version of the GPT architecture Leap-0 trained from scratch using FineWeb-Edu, an open-source dataset. The project demonstrates the design, training, and optimization of a custom natural language model on local hardware.

Description of the image

Figure 1: Architecture of Leap

Features

  • Custom GPT Architecture: A miniaturized version of the GPT model tailored for efficient training on limited hardware.
  • Local Training: Complete model training executed on local resources, enabling cost-effective development.
  • Open-Source Datasets: Trained using publicly available FineWeb-Edu dataset to ensure accessibility and reproducibility.
  • Scalable Design: Architecture optimized for experimentation and scalability while maintaining resource efficiency.

Implementation Details

  1. Model Architecture

    • A streamlined GPT-based architecture designed for reduced complexity and improved training efficiency.
    • Incorporates modifications to parameter scaling to suit resource-constrained environments.
  2. Training

    • Training executed locally on NVIDIA GeForce RTX 4500 ada 24GB GPU, leveraging PyTorch.
  3. Testing

    • A simple Streamlit UI created for testing generation capability of the model.

Model Architecture

Configuration

  • Sequence Length: 512 tokens
  • Vocabulary Size: 48,951 tokens
    • Includes 50,000 BPE merges, 256 special byte tokens, and 1 <|endoftext|> token.
  • Number of Layers: 4 transformer blocks
  • Attention Heads: 8 per block
  • Embedding Dimension: 512
  • Dropout: 0.1

Components

  1. Embeddings:

    • Word Embeddings (wte): Learnable token embeddings of size n_embd.
    • Position Embeddings (wpe): Learnable positional embeddings for sequences up to block_size.
  2. Transformer Blocks:

    • A stack of 4 transformer blocks, each comprising:
      • Multi-head self-attention mechanisms.
      • Feedforward networks for feature transformation.
  3. Output Head:

    • Linear Layer (lm_head): Maps hidden states to logits for token predictions.
    • Implements weight sharing between token embeddings (wte) and output projection for parameter efficiency.
  4. Layer Normalization:

    • Final layer normalization (ln_f) ensures stable optimization.

Current Status:

  1. Dataset Used: FineWeb-Edu (18.5 GB) entirely.
  2. Training Steps: 5000
  3. Time Taken: ~ 7 hours
  4. File format: .pt

Requirements

  • Python 3.8+
  • PyTorch 2.0+ or TensorFlow 2.10+
  • CUDA-enabled GPU with at least 4GB VRAM (recommended)
  • Dependencies listed in requirements.txt
  • Note: Different OS support different versions of PyTorch/Tensorflow to use CUDA (local GPU). Install only after verifying for your OS.
Downloads last month
7
Safetensors
Model size
95.9M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support