NoPE GPT
NoPE GPT is a generative pretrained Transformer-style (GPT) language model with no positional embeddings (NoPE). Built using PyTorch and trained on HuggingFace's Fineweb, SmolTalk, and UltraFeedback datasets, NoPE GPT can answer questions, summarize documents, use tools, and more.
Features
No positional embeddings (NoPE): NoPE GPT aims to be a more parsimonious model by completely removing positional embeddings from the architecture allowing the context length to vary without complex model surgery. Despite having no positional embeddings, NoPE GPT performs better at context length generalization than the best relative embeddings (ALiBi, RoPE, T5) offering good performance even when operating within 2X the trained context window.
Fast and memory-efficient: NoPE GPT employs a number of training and inference-time optimizations such as Group Query Attention, KV-caching, quantization-aware fine-tuning (QAT), activation checkpointing, and fully-sharded data parallel (FSDP) pretraining. As such, you can train and infer on relatively modest hardware.
Fully Open-source: Unlike closed-source LLMs, NoPE GPT provides both the model weights and the source code to train, fine-tune, export, and generate text from the model using your own hardware.
Pretrained Models
Name | Context Length | Vocab. Size | Embedding Dim. | Query Heads | Key/Value Heads | Layers |
---|---|---|---|---|---|---|
NoPE-GPT-400M-Chat | 8192 | 50,261 | 1280 | 20 | 5 | 20 |
NoPE-GPT-400M-Base | 8192 | 50,257 | 1280 | 20 | 5 | 20 |
Installation
The code required to run inference comes as a Python package that you can install with your favorite package manager such as pip.
pip install nope-gpt
Pretrained Examples
This first example we'll show how to load a pretrained base model from HuggingFace Hub and then use it to generate text. First, make sure the nope-gpt
package is installed into your project. Once the package is installed you can load pretrained weights from HuggingFace Hub like in the example below.
from nope_gpt.model import NoPEGPT
from nope_gpt.tokenization import BaseTokenizer
model_name = "andrewdalpino/NoPE-GPT-400M-Base"
model = NoPEGPT.from_pretrained(model_name)
tokenizer = BaseTokenizer.from_pretrained(model_name)
Then, to generate text, provide a prompt, tokenize it, and iterate through the generate()
method until the model outputs a stop token.
import torch
prompt = input("Enter a prompt: ")
prompt = tokenizer.tokenize(prompt)
prompt = torch.tensor(prompt, dtype=torch.int64)
for token, probability in model.generate(prompt):
if token.item() in tokenizer.stop_tokens:
break
out = tokenizer.decode_single_token(token)
print(out, end="", flush=True)
Generating text from the base model is the simplest way to get started with model inference, however, it is not the most useful when it comes to being able to chat with and guide the model output. In this example we'll load one of the pretrained chat models from HuggingFace Hub and then chat with it. In addition, we'll make use of short-term memory so the model can remember the chat history.
First, load a pretrained chat model from HuggingFace Hub like in the example below.
from nope_gpt.model import NoPEGPT
from nope_gpt.tokenization import ChatTokenizer
model_name = "andrewdalpino/NoPE-GPT-400M-Chat"
model = NoPEGPT.from_pretrained(model_name)
tokenizer = ChatTokenizer.from_pretrained(model_name)
Then, we'll define a partial function that will generate tokens with a set of default parameters such as max_tokens
, context_length
, and temperature
.
from functools import partial
generate = partial(
model.generate,
max_tokens=2000,
context_length=8192,
temperature=0.7,
top_k=500
top_p=0.9,
repeat_penalty=0.1,
repeat_window=50,
)
Next, we'll instantiate a BufferWindowMemory
object to handle the chat history and craft a system message that will guide generation. Note that messages are inputted as dicts with role
and content
keys. For a system message use the system
role.
from nope_gpt.memory import BufferWindowMemory
memory = BufferWindowMemory(4)
system_message = {
"role": "system",
"content": "You are a friendly AI assistant.",
}
Finally, prompt the user for input, adds the system message and chat history to the context, tokenizes the messages, and then generates the assistant
response.
import torch
while True:
prompt = input("Enter a prompt: ")
user_message = {
"role": "user",
"content": prompt,
}
memory.add_message(user_message)
messages = [system_message] + memory.get_history()
tokens = tokenizer.tokenize_prompt(messages)
prompt = torch.tensor(tokens, dtype=torch.int64, device=args.device)
response = ""
for token, probability in generate(prompt):
if token.item() in tokenizer.stop_tokens:
break
out = tokenizer.decode_single_token(token)
print(out, end="", flush=True)
response += out
print("\n")
assistant_message = {
"role": "assistant",
"content": response,
}
memory.add_message(assistant_message)
You're done! For more advanced usages take a look at the generate.py
and chat.py
scripts located in the code repository.
Training and Fine-tuning
In addition to the inference code, we also provide training and fine-tuning code so you can build your own NoPE GPT models. Before getting started, take a look at the model_sizing.ipynb
IPython notebook in the project repo for a guide to sizing your model based on the amount of memory and compute you have available.
Clone the project repo
We'll need the code from the project repository to train and/or fine-tune the model.
git clone https://github.com/andrewdalpino/NoPE-GPT
Install Project Dependencies
Project dependencies are specified in the requirements.txt
file. You can install them with pip using the following command from the project root. We recommend using a virtual environment such as venv
to keep package dependencies on your system tidy.
python -m venv ./.venv
source ./.venv/bin/activate
pip install -r requirements.txt
Pretraining
Pretraining focuses on building a foundation of language and general knowledge to use as a base for future supervised fine-tuning. The training objective is to predict the next token in a sample of text. It is a self-supervised form of training because the model learns from masked inputs of unsupervised data. For the pretraining corpus we use the Fineweb dataset which consists of 15T high-quality tokens gathered from the worldwide web. In addition, the dataset has been split into 3 subsets (10BT, 100BT, and 350BT versions) for training smaller models.
python pretrain.py
Note that it will take a while to download and pre-process the dataset the first time that the training script is run.
To customize the default architecture you can adjust the embedding_dimensions
, attention heads, num_hidden_layers
, and feed_forward_ratio
arguments of the pretraining script.
python pretrain.py --embedding_dimensions=4096 --num_q_heads=64 --num_kv_heads=16 --num_hidden_layers=48 --feed_forward_ratio=4
You can also adjust the batch_size
, learning_rate
, and gradient_accumulation_steps
to suite your training setup.
python pretrain.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128
If you are planning a long training run, it is recommended to set a random seed. This will ensure that any random state is preserved if the process gets interrupted.
python pretrain.py --seed=42
For distributed training, use PyTorch's torchrun extension to launch a distributed data parallel (DDP) session. The example below is for executing the training script on a single node with 8 individual GPUs.
torchrun --standalone --nnodes=1 --nproc-per-node=8 pretrain.py --batch_size=16 --gradient_accumulation_steps=128
Note that when training in data-parallel mode it's important that the gradient_accumulation_steps
divides evenly into the world size for maximum performance. For example, if we have an 8 GPU cluster, we could perform 32 gradient accumulation steps in exactly 4 passes over the network.
Pretraining Arguments
Argument | Default | Type | Description |
---|---|---|---|
--dataset_subset | "sample-10BT" | str | The subset of the Fineweb dataset to train on. Options are sample-10BT , sample-100BT , and sample-350BT . Set to None to train on the full 15T token dataset. |
--token_encoding | "r50k_base" | str | The Tiktoken encoding scheme to use when tokenizing the dataset. Options include r50k_base , p50k_base , cl100k_base , and o200k_base . |
--dataset_path | "./datasets" | str | The path to the preprocessed dataset files on disk. |
--batch_size | 2 | int | The number of samples of size tokens_per_sample to pass through the network at a time. |
--gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the model weights. |
--tokens_per_sample | 4096 | int | The number of tokens to pack into a single training sequence. This is sometimes called the block size or context length. |
--max_steps | 10000 | int | The maximum number of steps to take for pretraining. |
--learning_rate | 1e-2 | float | The learning rate of the Adafactor optimizer. |
--low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
--max_gradient_norm | 10.0 | float | Clip gradients above this threshold norm before stepping. |
--embedding_dimensions | 1024 | int | The dimensionality of the token embeddings. |
--num_q_heads | 16 | int | The number of query heads within every attention layer. |
--num_kv_heads | 4 | int | The number of key and value heads within every attention layer. |
--num_hidden_layers | 16 | int | The number of attention/MLP blocks within the body of the network. |
--feed_forward_ratio | 4 | (1, 2, 4) | The ratio of hidden neurons to embedding dimensions in the MLP layers of the network. |
--dropout | 0.0 | float | The proportion of signals to send to zero during training as regularization. |
--activation_checkpointing | False | bool | Should we use activation checkpointing? This will drastically reduce memory utilization during training at the cost of recomputing the forward pass. |
--ddp_sharding_level | 2 | int | The level of sharding to use for DDP training. Options are 2 or 3 for partial and full sharding respectively, or 0 for no sharding. |
--eval_interval | 100 | int | Evaluate the model after this many epochs on the testing set. |
--num_eval_samples | 2048 | int | The number of hold-out samples to use for validation during training. |
--checkpoint_interval | 100 | int | Save the model checkpoint to disk every this many epochs. |
--checkpoint_path | "./checkpoints/checkpoint.pt" | str | The path to the base checkpoint file on disk. |
--resume | False | bool | Should we resume training from the last checkpoint? |
--run_dir_path | "./runs" | str | The path to the TensorBoard run directory for this training session. |
--device | "cpu" | str | The device to run the training on ex cuda , cuda:0 , mps , cpu . |
--seed | None | int | The seed for the random number generator. |
Fine-tuning
Instruction-tuning is a supervised training technique focused on developing specialized objectives such as chatting, text summarization, chain-of-thought, and prompt rewriting. We use the SmolTalk and UltraFeedback datasets by HuggingFace as fine-tuning corpora because they include a broad range of training objectives such as conversation, instruction following, summarization, and human preference alignment.
python fine-tune.py
To pick which dataset subsets to train on you can specify them in a comma-separated list like in the example below.
python fine-tune.py --dataset_subsets=smol-magpie-ultra,smol-summarize,ultra-feedback
You can also adjust the batch_size
, learning_rate
, and gradient_accumulation_steps
just like we did with pre-training.
python fine-tune.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=32
To adjust the number of trainable LoRA parameters as well as the strength of the LoRA and Dropout signals you can change the --rank
and --alpha
arguments respectively.
python fine-tune.py --rank=4 --alpha=2.0
To quantize the base model weights during fine-tuning (QLoRA) you can specify the quantize_base_weights
argument and to adjust the quant group size set the quant_group_size
argument like in the example below.
python fine-tune.py --quantize_base_weights --quant_group_size=128
In memory constrained environments, you can enable activation checkpointing to trade off compute for memory efficiency by recomputing the activations of each decoder block during backpropagation.
python fine-tune.py --activation_checkpointing
Fine-tuning Arguments
Argument | Default | Type | Description |
---|---|---|---|
--base_checkpoint_path | None | string | The path to the base model checkpoint on disk. |
--dataset_subset | "all" | str | A comma-separated list of subsets of the dataset to train on. Options are all , apigen-80k , everyday-conversations , explore-instruct-rewriting , longalign , metamathqa-50k , numina-cot-100k , openhermes-100k , self-oss-instruct , smol-constraints , smol-magpie-ultra , smol-rewrite , smol-summarize , systemchats-30k , and ultra-feedback . |
--max_tokens_per_sample | 4096 | int | The maximum number of tokens to pack into a single training sequence. |
--filter_long_samples | False | bool | Should we filter out samples that are longer than the max_tokens_per_sample? |
--num_dataset_processes | 8 | int | The number of processes to use for processing the dataset. |
--batch_size | 1 | int | The number of samples to pass through the network at a time. |
--gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
--num_epochs | 2 | int | The number of epochs to train for. |
--learning_rate | 1e-2 | float | The learning rate of the Adafactor optimizer. |
--low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
--max_gradient_norm | 1.0 | float | Clip gradients above this threshold norm before stepping. |
--rank | 8 | int | The rank of the LoRA decomposition matrices. |
--alpha | 1.0 | float | The strength of the LoRA signal. |
--freeze_token_embeddings | False | bool | Should we freeze the weights of the token embeddings? |
--activation_checkpointing | False | bool | Should we use activation checkpointing? This will reduce drastically memory utilization during training at the cost of needing to recompute the forward pass. |
--eval_interval | 1 | int | Evaluate the model after this many epochs on the testing set. |
--num_eval_samples | 2048 | int | The number of hold-out samples to use for validation during training. |
--checkpoint_interval | 1 | int | Save the model parameters to disk every this many epochs. |
--checkpoint_path | "./checkpoints/checkpoint.pt" | str | The path to the model checkpoint. |
--resume | False | bool | Should we resume training from the last checkpoint? |
--run_dir_path | "./runs" | str | The path to the TensorBoard run directory for this training session. |
--device | "cpu" | str | The device to run the training on ex cuda , cuda:0 , mps , cpu . |
--seed | None | int | The seed for the random number generator. |
Training Dashboard
We use TensorBoard to capture and display pretraining events such as loss and gradient norm updates. To launch the dashboard server run the following command from the terminal.
tensorboard --logdir=./runs
Then navigate to the dashboard using your favorite web browser.
References:
- G. Penedo, et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale, 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks.
- L. B. Allal, et al. SmolLM2 - with great data, comes great performance, 2024.
- A. Radford, et al. Language Models are Unsupervised Multitask Learners, OpenAI, 2019.
- T. Brown, et al. Language Models are Few-Shot Learners. OpenAI, 2020.
- A. Kazemnejad, et al. The Impact of Positional Encoding on Length Generalization in Transformers, 37th Conference on Neural Information Processing Systems (NeurIPS 2023).
- S. Rajbhandari, et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, 2020.
- J. R. Hermans, et al. Accumulated Gradient Normalization, JMLR: Workshop and Conference Proceedings, 2017.
- T. Chen, et al. Training Deep Nets with Sublinear Memory Cost. MIT, 2019.
- B. Zhang, et al. Root Mean Square Layer Normalization. 33rd Conference on Neural Information Processing Systems, NeurIPS 2019.
- J. Kaplan, et al. Scaling Laws for Neural Language Models, OpenAI, 2020.
- J. Hoffman, et al. Training Compute-Optimal Large Language Models, Deep Mind, 2022.
- J. Ainslie, et al. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, Google Research, 2023.
- Downloads last month
- 30