andrewdalpino commited on
Commit
d866d0b
·
verified ·
1 Parent(s): 60cfb30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -35
README.md CHANGED
@@ -3,27 +3,29 @@ library_name: transformers
3
  datasets:
4
  - HuggingFaceFW/fineweb
5
  - HuggingFaceTB/smoltalk
 
6
  language:
7
  - en
8
  metrics:
9
  - perplexity
10
  pipeline_tag: text-generation
11
  ---
12
- # LightGPT
13
 
14
- LightGPT is a lightweight generative pretrained Transformer (GPT) language model for the people! Built using [PyTorch](https://pytorch.org/) and trained on HuggingFace's [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) datasets, LightGPT can answer questions, follow instructions, summarize documents, chat, and more. Best of all, the model weights *and* code are fully open-source for you to customize, improve upon, and share with the world.
 
 
15
 
16
  ## Features
17
 
18
- - **No positional embeddings (NoPE)**: LightGPT aims to be a more parsimonious model by completely removing positional embeddings from the architecture. This allows for a variable context length without complex model surgery. Despite having no positional embeddings, LightGPT performs better at context length generalization than the best relative embeddings (ALiBi, RoPE, T5) offering good performance even when operating within 2X the trained context window.
19
 
20
- - **Low Memory Utilization**: LightGPT lets you progressively employ training-time memory optimizations such as fully-sharded data-parallel (FSDP), activation checkpointing, mixed precision, and low-memory optimizer updates that allow you to train larger models on smaller hardware.
21
 
22
- - **Fully Open-source**: Unlike closed-source LLMs, LightGPT provides both the model weights *and* the source code to train, fine-tune, export, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize access to AI and continually improve the models.
23
 
24
- ## Suggested Pretraining Configurations
25
 
26
- Below is a table of some suggested model pretraining configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
27
 
28
  | Name | Vocab. Size | Embedding Dim. | Attn. Heads | Layers | Parameters | Min. Training Tokens |
29
  |---|---|---|---|---|---|---|
@@ -34,9 +36,9 @@ Below is a table of some suggested model pretraining configurations but feel fre
34
  | XX-large | 200,017 | 8192 | 128 | 64 | 53B | 1T |
35
  | XXX-large | 200,017 | 8192 | 128 | 128 | 105B | 2T |
36
 
37
- We typically recommend a training `block size` (also referred to as context length) of between 1024 to 4096 for standard models and 4096 or higher for long-context applications such as conversational chatbots, retrieval augmented generation (RAG), and chain-of-thought (CoT) prompting.
38
 
39
- **Note**: LightGPT can be trained using variable block sizes since the architecture does not depend on any discrete positional embeddings. This flexibility allows you to progressively extend the context window during training.
40
 
41
  ## Install Project Dependencies
42
 
@@ -52,7 +54,7 @@ pip install -r requirements.txt
52
 
53
  ## Pretraining
54
 
55
- For the pretraining corpus we use the Fineweb dataset which consists of about 15T high-quality tokens gathered from the worldwide web. The dataset has been split into 3 subsets (10BT, 100BT, and 350BT versions) for training smaller models. If you'd like to start training right away, the default settings should work on most single-GPU systems with 12G of VRAM or more.
56
 
57
  ```
58
  python pretrain.py
@@ -87,12 +89,11 @@ torchrun --standalone --nnodes=1 --nproc-per-node=8 pretrain.py --batch_size=16
87
  | --dataset_subset | "sample-10BT" | str | The subset of the Fineweb dataset to train on. Options are `sample-10BT`, `sample-100BT`, and `sample-350BT`. Set to `None` to train on the full 15T token dataset. |
88
  | --token_encoding | "r50k_base" | str | The Tiktoken encoding scheme to use when tokenizing the dataset. Options include `r50k_base`, `p50k_base`, `cl100k_base`, and `o200k_base`. |
89
  | --dataset_path | "./datasets" | str | The path to the preprocessed dataset files on disk. |
90
- | --num_dataset_processes | 8 | int | The number of processes (CPUs) to use to process the dataset. |
91
- | --batch_size | 1 | int | The number of samples of size `tokens_per_sample` to pass through the network at a time. |
92
  | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the model weights. |
93
  | --tokens_per_sample | 1024 | int | The number of tokens to pack into a single training sequence. This is sometimes called the block size or context length. |
94
  | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
95
- | --num_epochs | 1686 | int | The number of epochs to train for. |
96
  | --learning_rate | 1e-2 | float | The learning rate of the Adafactor optimizer. |
97
  | --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
98
  | --low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
@@ -112,50 +113,77 @@ torchrun --standalone --nnodes=1 --nproc-per-node=8 pretrain.py --batch_size=16
112
  | --device | "cuda" | str | The device to run the computation on. |
113
  | --seed | None | int | The seed for the random number generator. |
114
 
115
- ### Training Dashboard
116
 
117
- We use [TensorBoard](https://www.tensorflow.org/tensorboard) to capture and display pretraining events such as loss and gradient norm updates. To launch the dashboard server run the following command from the terminal.
 
 
118
 
119
  ```
120
- tensorboard --logdir=./runs
121
  ```
122
 
123
- Then navigate to the dashboard using your favorite web browser.
124
 
125
- ## Instruction-tuning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  ### Instruction-tuning Arguments
128
 
129
  | Argument | Default | Type | Description |
130
  |---|---|---|---|
131
  | --base_model_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint on disk. |
132
- | --dataset_subset | "all" | str | The subset of the SmolTalk dataset to train on. Options are `all`, `smol-magpie-ultra`, `smol-constraints`, `smol-rewrite`, and `smol-summarize`. |
133
  | --max_tokens_per_sample | 1024 | int | The maximum number of tokens to pack into a single training sequence. |
134
- | --mask_input | False | bool | Should we mask the input part of the training sequences i.e. only train on the supervised output? |
135
- | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
136
  | --gradient_accumulation_steps | 64 | int | The number of batches to pass through the network before updating the weights. |
137
  | --learning_rate | 5e-4 | float | The learning rate of the Adafactor optimizer. |
138
  | --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
139
- | --optimizer_low_memory | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
 
140
  | --rank | 8 | int | The rank of the LoRA decomposition matrices. |
141
  | --alpha | 1.0 | float | The strength of the LoRA signal. |
142
  | --dropout | 0.05 | float | The proportion of signals to send to zero during training as regularization. |
143
- | --num_epochs | 4 | int | The number of epochs to train for. |
144
  | --activation_checkpointing | False | bool | Should we use activation checkpointing? This will reduce drastically memory utilization during training at the cost of needing to recompute the forward pass. |
145
  | --eval_interval | 1 | int | Evaluate the model after this many epochs on the testing set. |
 
146
  | --checkpoint_interval | 1 | int | Save the model parameters to disk every this many epochs. |
147
- | --checkpoint_path | "./checkpoints/lora_instruction.pt" | string | The path to the LoRA checkpoint. |
148
  | --resume | False | bool | Should we resume training from the last checkpoint? |
149
  | --run_dir_path | "./runs/instruction-tune" | str | The path to the TensorBoard run directory for this training session. |
150
  | --device | "cuda" | string | The device to run the computation on. |
151
  | --seed | None | int | The seed for the random number generator. |
152
 
 
 
 
 
 
 
 
 
 
 
153
  ## Text Generation
154
 
155
- After training, you can generate text from the model by running the `generate.py` script from the commandline. This inference script samples tokens from the model one at a time conditioned on a prompt and any previously generated tokens, together referred to as the context window. In the example below we are choosing to only sample from the `top_k` predicted tokens that have at least `top_p` cumulative probability mass when ordered descending by predicted probability.
156
 
157
  ```
158
- python generate.py --top_k=500 --top_p=0.9
159
  ```
160
 
161
  ### Generation Arguments
@@ -163,31 +191,45 @@ python generate.py --top_k=500 --top_p=0.9
163
  | Argument | Default | Type | Description |
164
  |---|---|---|---|
165
  | --checkpoint_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint file on disk. |
166
- | --lora_path | None | string | The path to the LoRA checkpoint. |
167
- | --max_tokens | 1000 | int | The maximum number of tokens that the model should generate per sample. |
168
  | --context_length | 1024 | int | The number of tokens to keep within the context window of the current prediction. |
169
  | --temperature | 1.0 | float | The amount of regularization applied to the candidate token probabilities. |
170
  | --top_k | 500 | int | Only sample from this many candidate tokens with the highest probabilities. |
171
  | --top_p | 0.9 | float | Of the `top_k` tokens, drop all but the `top_p` portion of the cumulative probability distribution. |
 
 
172
  | --device | "cuda" | string | The device to run the computation on. |
173
  | --seed | None | int | The seed for the random number generator. |
174
 
175
- We also provide a script that samples entire sequences rather than single tokens independently which we call `beam_search.py`. Beam search maintains a list of the top `beam_width` candidate sequences and outputs the top `num_candidates` completed sequences with the highest overall priority. It is a form of greedy search that works well for some things like text summarization and translation but often results in less natural sounding responses and may even repeat certain sequences.
 
 
176
 
177
  ```
178
- python beam_search.py --beam_width=16 --num_candidates=3
179
  ```
180
 
181
- ### Beam Search Arguments
 
 
 
 
 
 
182
 
183
  | Argument | Default | Type | Description |
184
  |---|---|---|---|
185
  | --checkpoint_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint file on disk. |
186
  | --lora_path | None | string | The path to the LoRA checkpoint. |
187
- | --max_tokens | 100 | int | The maximum number of tokens that the model should generate per sample. |
 
188
  | --context_length | 1024 | int | The number of tokens to keep within the context window of the current prediction. |
189
- | --num_candidates | 3 | int | The number of candidate sequences to output. |
190
- | --beam_width | 16 | int | The number of candidate sequences to keep track of during search. |
 
 
 
191
  | --device | "cuda" | string | The device to run the computation on. |
192
  | --seed | None | int | The seed for the random number generator. |
193
 
@@ -203,4 +245,3 @@ python beam_search.py --beam_width=16 --num_candidates=3
203
  >- B. Zhang, et al. Root Mean Square Layer Normalization. 33rd Conference on Neural Information Processing Systems, NeurIPS 2019.
204
  >- J. Kaplan, et al. Scaling Laws for Neural Language Models, OpenAI, 2020.
205
  >- J. Hoffman, et al. Training Compute-Optimal Large Language Models, Deep Mind, 2022.
206
- >- R. Sutton. The Bitter Lesson, https://www.incompleteideas.net/IncIdeas/BitterLesson.html, 2019.
 
3
  datasets:
4
  - HuggingFaceFW/fineweb
5
  - HuggingFaceTB/smoltalk
6
+ - HuggingFaceH4/ultrafeedback_binarized
7
  language:
8
  - en
9
  metrics:
10
  - perplexity
11
  pipeline_tag: text-generation
12
  ---
 
13
 
14
+ # NoPE GPT
15
+
16
+ NoPE GPT is a generative pretrained Transformer (GPT) language model with no positional embeddings (NoPE). Built using [PyTorch](https://pytorch.org/) and trained on HuggingFace's [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk), and [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) datasets, NoPE GPT can answer questions, follow instructions, summarize documents, chat, and more. Best of all, the model weights *and* code are fully open-source for you to customize, improve upon, and share with the world.
17
 
18
  ## Features
19
 
20
+ - **No positional embeddings (NoPE)**: NoPE GPT aims to be a more parsimonious model by completely removing positional embeddings from the architecture. This allows for a variable context length without complex model surgery. Despite having no positional embeddings, NoPE GPT performs better at context length generalization than the best relative embeddings (ALiBi, RoPE, T5) offering good performance even when operating within 2X the trained context window.
21
 
22
+ - **Low Memory Utilization**: NoPE GPT lets you progressively employ training-time memory optimizations such as fully-sharded data-parallel (FSDP), activation checkpointing, mixed precision, and low-memory optimizer updates that allow you to train larger models on smaller hardware.
23
 
24
+ - **Fully Open-source**: Unlike closed-source LLMs, NoPE GPT provides both the model weights *and* the source code to train, fine-tune, export, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize access to AI and continually improve the models.
25
 
26
+ ## Recommended Pretraining Configurations
27
 
28
+ Below is a table of some recommended model pretraining configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
29
 
30
  | Name | Vocab. Size | Embedding Dim. | Attn. Heads | Layers | Parameters | Min. Training Tokens |
31
  |---|---|---|---|---|---|---|
 
36
  | XX-large | 200,017 | 8192 | 128 | 64 | 53B | 1T |
37
  | XXX-large | 200,017 | 8192 | 128 | 128 | 105B | 2T |
38
 
39
+ We typically recommend a training `block size` (also referred to as context length) of between 1024 to 4096 for standard models and 4096 or higher for long-context applications such as conversational chat bots, retrieval augmented generation (RAG), and chain-of-thought (CoT) prompting a.k.a "reasoning" models.
40
 
41
+ **Note**: NoPE GPT can be trained using variable block sizes since the architecture does not depend on any discrete positional embeddings. This flexibility allows you to progressively extend the context window during training.
42
 
43
  ## Install Project Dependencies
44
 
 
54
 
55
  ## Pretraining
56
 
57
+ When we pre-train NoPE GPT we are focused on building a foundation of language and general knowledge to use as a base for further specialized training. The training objective is to predict the next token in a sample of text. It is a self-supervised form of training because the model learns from masked inputs of unsupervised data. The For the pretraining corpus we use the Fineweb dataset which consists of about 15T high-quality tokens gathered from the worldwide web. The dataset has been split into 3 subsets (10BT, 100BT, and 350BT versions) for training smaller models. If you'd like to start training right away, the default settings should work on most single-GPU systems with 12G of VRAM or more.
58
 
59
  ```
60
  python pretrain.py
 
89
  | --dataset_subset | "sample-10BT" | str | The subset of the Fineweb dataset to train on. Options are `sample-10BT`, `sample-100BT`, and `sample-350BT`. Set to `None` to train on the full 15T token dataset. |
90
  | --token_encoding | "r50k_base" | str | The Tiktoken encoding scheme to use when tokenizing the dataset. Options include `r50k_base`, `p50k_base`, `cl100k_base`, and `o200k_base`. |
91
  | --dataset_path | "./datasets" | str | The path to the preprocessed dataset files on disk. |
92
+ | --batch_size | 2 | int | The number of samples of size `tokens_per_sample` to pass through the network at a time. |
 
93
  | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the model weights. |
94
  | --tokens_per_sample | 1024 | int | The number of tokens to pack into a single training sequence. This is sometimes called the block size or context length. |
95
  | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
96
+ | --num_epochs | 1690 | int | The number of epochs to train for. |
97
  | --learning_rate | 1e-2 | float | The learning rate of the Adafactor optimizer. |
98
  | --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
99
  | --low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
 
113
  | --device | "cuda" | str | The device to run the computation on. |
114
  | --seed | None | int | The seed for the random number generator. |
115
 
116
+ ## Instruction-tuning
117
 
118
+ Instruction-tuning is a supervised training technique focused on developing specialized objectives such as chatting, text summarization, chain-of-thought, and prompt rewriting. The overall objective is still to predict the next token but the dataset has been curated for these more specialized objectives. In addition, we introduce three new special tokens (`<|pad|>`, `<|im_start|>` and `<|im_end|>`) that demarcate padding tokens and system, user, and assistant messages for use in the ChatML format. We use the SmolTalk and UltraFeedback datasets by HuggingFace as fine-tuning corpora because they include a broad range of training objectives such as conversation, instruction following, summarization, and human preference alignment.
119
+
120
+ Unlike pre-training, fine-tuning is not as resource intensive due to training much fewer parameters. The default arguments will work for most GPUs with 12G of VRAM or more.
121
 
122
  ```
123
+ python instruction-tune.py
124
  ```
125
 
126
+ To pick which dataset subsets to train on you can specify them in a comma-separated list like in the example below.
127
 
128
+ ```
129
+ python instruction-tune.py --dataset_subsets=smol-magpie-ultra,smol-summarize,ultra-feedback
130
+ ```
131
+
132
+ You can also adjust the `batch_size`, `learning_rate`, and `gradient_accumulation_steps` just like we did with pre-training.
133
+
134
+ ```
135
+ python instruction-tune.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=32
136
+ ```
137
+
138
+ To adjust the number of trainable LoRA parameters as well as the strength of the LoRA and Dropout signals you can change the `--rank`, `--alpha`, and `--dropout` arguments respectively.
139
+
140
+ ```
141
+ python instruction-tune.py --rank=4 --alpha=0.8 --dropout=0.1
142
+ ```
143
 
144
  ### Instruction-tuning Arguments
145
 
146
  | Argument | Default | Type | Description |
147
  |---|---|---|---|
148
  | --base_model_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint on disk. |
149
+ | --dataset_subset | "all,ultra-feedback" | str | A comma-separated list of subsets of the dataset to train on. Options are `all`, `apigen-80k`, `everyday-conversations`, `explore-instruct-rewriting`, `longalign`, `metamathqa-50k`, `numina-cot-100k`, `openhermes-100k`, `self-oss-instruct`, `smol-constraints`, `smol-magpie-ultra`, `smol-rewrite`, `smol-summarize`, `systemchats-30k`, and `ultra-feedback`. |
150
  | --max_tokens_per_sample | 1024 | int | The maximum number of tokens to pack into a single training sequence. |
151
+ | --batch_size | 2 | int | The number of samples to pass through the network at a time. |
 
152
  | --gradient_accumulation_steps | 64 | int | The number of batches to pass through the network before updating the weights. |
153
  | --learning_rate | 5e-4 | float | The learning rate of the Adafactor optimizer. |
154
  | --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
155
+ | --low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
156
+ | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold norm before stepping. |
157
  | --rank | 8 | int | The rank of the LoRA decomposition matrices. |
158
  | --alpha | 1.0 | float | The strength of the LoRA signal. |
159
  | --dropout | 0.05 | float | The proportion of signals to send to zero during training as regularization. |
160
+ | --num_epochs | 3 | int | The number of epochs to train for. |
161
  | --activation_checkpointing | False | bool | Should we use activation checkpointing? This will reduce drastically memory utilization during training at the cost of needing to recompute the forward pass. |
162
  | --eval_interval | 1 | int | Evaluate the model after this many epochs on the testing set. |
163
+ | --eval_ratio | 0.1 | float | The proportion of testing samples to validate the model on. |
164
  | --checkpoint_interval | 1 | int | Save the model parameters to disk every this many epochs. |
165
+ | --checkpoint_path | "./checkpoints/instruct.pt" | string | The path to the LoRA checkpoint. |
166
  | --resume | False | bool | Should we resume training from the last checkpoint? |
167
  | --run_dir_path | "./runs/instruction-tune" | str | The path to the TensorBoard run directory for this training session. |
168
  | --device | "cuda" | string | The device to run the computation on. |
169
  | --seed | None | int | The seed for the random number generator. |
170
 
171
+ ## Training Dashboard
172
+
173
+ We use [TensorBoard](https://www.tensorflow.org/tensorboard) to capture and display pretraining events such as loss and gradient norm updates. To launch the dashboard server run the following command from the terminal.
174
+
175
+ ```
176
+ tensorboard --logdir=./runs
177
+ ```
178
+
179
+ Then navigate to the dashboard using your favorite web browser.
180
+
181
  ## Text Generation
182
 
183
+ After pre-training, you can generate text from the model by running the `generate.py` script from the commandline. This inference script samples tokens from the model one at a time conditioned on a prompt and any previously generated tokens, together referred to as the context window. In the example below we are choosing to only sample from the `top_k` predicted tokens that have at least `top_p` cumulative probability mass when ordered descending by predicted probability.
184
 
185
  ```
186
+ python generate.py --temperature=0.4 --top_k=500 --top_p=0.9
187
  ```
188
 
189
  ### Generation Arguments
 
191
  | Argument | Default | Type | Description |
192
  |---|---|---|---|
193
  | --checkpoint_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint file on disk. |
194
+ | --max_tokens | 2000 | int | The maximum number of tokens that the model should generate per sample. |
195
+ | --colorize_tokens | False | bool | Should we colorize the generated tokens based on the certainty of the model? |
196
  | --context_length | 1024 | int | The number of tokens to keep within the context window of the current prediction. |
197
  | --temperature | 1.0 | float | The amount of regularization applied to the candidate token probabilities. |
198
  | --top_k | 500 | int | Only sample from this many candidate tokens with the highest probabilities. |
199
  | --top_p | 0.9 | float | Of the `top_k` tokens, drop all but the `top_p` portion of the cumulative probability distribution. |
200
+ | --repeat_penalty | 0.1 | float | The proportion of the logit to penalize for previously generated tokens. |
201
+ | --repeat_window | 50 | int | The number of tokens to keep within the repeat window. |
202
  | --device | "cuda" | string | The device to run the computation on. |
203
  | --seed | None | int | The seed for the random number generator. |
204
 
205
+ ## Chatting
206
+
207
+ Once properly instruction-tuned you can use the chat script to hold back-and-forth conversations with the model. In addition, you can provide a custom system message that serves to focus activations for a particular tone or task. To start chatting, run the chat script like in the example below.
208
 
209
  ```
210
+ python chat.py
211
  ```
212
 
213
+ Since the chat script uses the same sampling technique as the `generate.py` script, you can use the same arguments to control the generation process such as `--temperature` and `top_k`.
214
+
215
+ ```
216
+ python chat.py --temperature=0.8 --top_k=300
217
+ ```
218
+
219
+ ### Chat Arguments
220
 
221
  | Argument | Default | Type | Description |
222
  |---|---|---|---|
223
  | --checkpoint_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint file on disk. |
224
  | --lora_path | None | string | The path to the LoRA checkpoint. |
225
+ | --max_tokens | 2000 | int | The maximum number of tokens that the model should generate per sample. |
226
+ | --colorize_tokens | False | bool | Should we colorize the generated tokens based on the certainty of the model? |
227
  | --context_length | 1024 | int | The number of tokens to keep within the context window of the current prediction. |
228
+ | --temperature | 0.7 | float | The amount of regularization applied to the candidate token probabilities. |
229
+ | --top_k | 500 | int | Only sample from this many candidate tokens with the highest probabilities. |
230
+ | --top_p | 0.9 | float | Of the `top_k` tokens, drop all but the `top_p` portion of the cumulative probability distribution. |
231
+ | --repeat_penalty | 0.1 | float | The proportion of the logit to penalize for previously generated tokens. |
232
+ | --repeat_window | 50 | int | The number of tokens to keep within the repeat window. |
233
  | --device | "cuda" | string | The device to run the computation on. |
234
  | --seed | None | int | The seed for the random number generator. |
235
 
 
245
  >- B. Zhang, et al. Root Mean Square Layer Normalization. 33rd Conference on Neural Information Processing Systems, NeurIPS 2019.
246
  >- J. Kaplan, et al. Scaling Laws for Neural Language Models, OpenAI, 2020.
247
  >- J. Hoffman, et al. Training Compute-Optimal Large Language Models, Deep Mind, 2022.