# NGen3: Next-Generation Foundational Model NGen3 is a production-level foundational language model inspired by state-of-the-art architectures such as GPT-4, Claude-3, and Llama 2. It is designed to be highly modular, efficient, and accessible via a flexible command-line interface (CLI). NGen3 supports multiple model variants—from 7M parameters to 1B parameters—and offers a comprehensive suite of tools for: - **Tokenization:** Process text from local files, URLs, or Hugging Face datasets. - **Training:** Train the model on tokenized data. - **Sampling:** Generate text from trained models. - **Exporting:** Save models and minimal tokenizer configurations in formats compatible with Hugging Face. - **Knowledge Distillation:** Train a smaller student model using a larger teacher model. - **Fine-Tuning:** Adapt a distilled model on conversational data (from local sources or directly from Hugging Face). This repository provides a complete implementation of the NGen3 model along with detailed CLI commands to facilitate experimentation and research. --- ## Table of Contents - [Model Overview](#model-overview) - [Architecture](#architecture) - [Installation](#installation) - [Usage](#usage) - [Tokenization](#tokenization) - [Training](#training) - [Sampling](#sampling) - [Exporting](#exporting) - [Knowledge Distillation](#knowledge-distillation) - [Fine-Tuning](#fine-tuning) - [Local Fine-Tuning](#local-fine-tuning) - [Hugging Face Fine-Tuning](#hugging-face-fine-tuning) - [Hyperparameters](#hyperparameters) - [Contributing](#contributing) - [License](#license) - [Acknowledgements](#acknowledgements) --- ## Model Overview NGen3 is designed for rapid development and deployment of foundational language models. Its flexible CLI allows users to: - **Tokenize Text:** Convert raw text or datasets into tokenized binary format. - **Train Models:** Use various hyperparameter configurations based on the desired model size. - **Generate Samples:** Evaluate model performance and generate text samples. - **Export Models:** Easily export models in `safetensors` and JSON configurations for integration with Hugging Face tools. - **Distill Models:** Leverage knowledge distillation to compress larger models into efficient student variants. - **Fine-Tune on Conversations:** Adapt models to conversational data using both local and Hugging Face datasets. --- ## Architecture NGen3’s architecture is built upon the transformer decoder design. Key components include: - **Token and Positional Embeddings:** Learnable embeddings that encode input tokens and their positions. - **Stack of Transformer Blocks:** Each block contains: - **Causal Self-Attention:** With multi-head attention and masking to prevent information leakage. - **MLP (Feed-Forward Network):** Utilizes GELU activation for non-linearity. - **Residual Connections and Layer Normalization:** Stabilize training and improve convergence. - **Final Projection Layer:** Maps embeddings to logits over the vocabulary. The model supports variants with parameter counts ranging from 7M to 1B, making it adaptable for various research and production needs. --- ## Installation Ensure you have Python 3.8+ installed along with the following packages: - PyTorch - transformers - datasets - tqdm - safetensors (for export functionality) Install the required packages using pip: ```bash pip install torch transformers datasets tqdm safetensors