# NGen3: Next-Generation Foundational Model

NGen3 is a production-level foundational language model inspired by state-of-the-art architectures such as GPT-4, Claude-3, and Llama 2. It is designed to be highly modular, efficient, and accessible via a flexible command-line interface (CLI). NGen3 supports multiple model variants—from 7M parameters to 1B parameters—and offers a comprehensive suite of tools for:

- **Tokenization:** Process text from local files, URLs, or Hugging Face datasets.
- **Training:** Train the model on tokenized data.
- **Sampling:** Generate text from trained models.
- **Exporting:** Save models and minimal tokenizer configurations in formats compatible with Hugging Face.
- **Knowledge Distillation:** Train a smaller student model using a larger teacher model.
- **Fine-Tuning:** Adapt a distilled model on conversational data (from local sources or directly from Hugging Face).

This repository provides a complete implementation of the NGen3 model along with detailed CLI commands to facilitate experimentation and research.

---

## Table of Contents

- [Model Overview](#model-overview)
- [Architecture](#architecture)
- [Installation](#installation)
- [Usage](#usage)
  - [Tokenization](#tokenization)
  - [Training](#training)
  - [Sampling](#sampling)
  - [Exporting](#exporting)
  - [Knowledge Distillation](#knowledge-distillation)
  - [Fine-Tuning](#fine-tuning)
    - [Local Fine-Tuning](#local-fine-tuning)
    - [Hugging Face Fine-Tuning](#hugging-face-fine-tuning)
- [Hyperparameters](#hyperparameters)
- [Contributing](#contributing)
- [License](#license)
- [Acknowledgements](#acknowledgements)

---

## Model Overview

NGen3 is designed for rapid development and deployment of foundational language models. Its flexible CLI allows users to:

- **Tokenize Text:** Convert raw text or datasets into tokenized binary format.
- **Train Models:** Use various hyperparameter configurations based on the desired model size.
- **Generate Samples:** Evaluate model performance and generate text samples.
- **Export Models:** Easily export models in `safetensors` and JSON configurations for integration with Hugging Face tools.
- **Distill Models:** Leverage knowledge distillation to compress larger models into efficient student variants.
- **Fine-Tune on Conversations:** Adapt models to conversational data using both local and Hugging Face datasets.

---

## Architecture

NGen3’s architecture is built upon the transformer decoder design. Key components include:

- **Token and Positional Embeddings:** Learnable embeddings that encode input tokens and their positions.
- **Stack of Transformer Blocks:** Each block contains:
  - **Causal Self-Attention:** With multi-head attention and masking to prevent information leakage.
  - **MLP (Feed-Forward Network):** Utilizes GELU activation for non-linearity.
  - **Residual Connections and Layer Normalization:** Stabilize training and improve convergence.
- **Final Projection Layer:** Maps embeddings to logits over the vocabulary.

The model supports variants with parameter counts ranging from 7M to 1B, making it adaptable for various research and production needs.

---

## Installation

Ensure you have Python 3.8+ installed along with the following packages:

- PyTorch
- transformers
- datasets
- tqdm
- safetensors (for export functionality)

Install the required packages using pip:

```bash
pip install torch transformers datasets tqdm safetensors