zinengtang commited on
Commit
4ed7218
·
1 Parent(s): e84a51e
README.md ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - pytorch
6
+ - causal-lm
7
+ - pythia
8
+ license: apache-2.0
9
+ datasets:
10
+ - EleutherAI/pile
11
+ library_name: gpt-neox
12
+ ---
13
+
14
+ The *Pythia Scaling Suite* is a collection of models developed to facilitate
15
+ interpretability research [(see paper)](https://arxiv.org/pdf/2304.01373.pdf).
16
+ It contains two sets of eight models of sizes
17
+ 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
18
+ models: one trained on the Pile, and one trained on the Pile after the dataset
19
+ has been globally deduplicated. All 8 model sizes are trained on the exact
20
+ same data, in the exact same order. We also provide 154 intermediate
21
+ checkpoints per model, hosted on Hugging Face as branches.
22
+
23
+ The Pythia model suite was deliberately designed to promote scientific
24
+ research on large language models, especially interpretability research.
25
+ Despite not centering downstream performance as a design goal, we find the
26
+ models <a href="#evaluations">match or exceed</a> the performance of
27
+ similar and same-sized models, such as those in the OPT and GPT-Neo suites.
28
+
29
+ <details>
30
+ <summary style="font-weight:600">Details on previous early release and naming convention.</summary>
31
+
32
+ Previously, we released an early version of the Pythia suite to the public.
33
+ However, we decided to retrain the model suite to address a few hyperparameter
34
+ discrepancies. This model card <a href="#changelog">lists the changes</a>;
35
+ see appendix B in the Pythia paper for further discussion. We found no
36
+ difference in benchmark performance between the two Pythia versions.
37
+ The old models are
38
+ [still available](https://huggingface.co/models?other=pythia_v0), but we
39
+ suggest the retrained suite if you are just starting to use Pythia.<br>
40
+ **This is the current release.**
41
+
42
+ Please note that all models in the *Pythia* suite were renamed in January
43
+ 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
44
+ comparing the old and new names</a> is provided in this model card, together
45
+ with exact parameter counts.
46
+ </details>
47
+ <br>
48
+
49
+ # Pythia-70M
50
+
51
+ ## Model Details
52
+
53
+ - Developed by: [EleutherAI](http://eleuther.ai)
54
+ - Model type: Transformer-based Language Model
55
+ - Language: English
56
+ - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
57
+ for training procedure, config files, and details on how to use.
58
+ [See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation
59
+ details.
60
+ - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
61
+ - License: Apache 2.0
62
+ - Contact: to ask questions about this model, join the [EleutherAI
63
+ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
64
+ Please read the existing *Pythia* documentation before asking about it in the
65
+ EleutherAI Discord. For general correspondence: [contact@eleuther.
66
+ ai](mailto:[email protected]).
67
+
68
+ <figure>
69
+
70
+ | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
71
+ | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
72
+ | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — |
73
+ | 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
74
+ | 410M | 302,311,424 | 24 | 1024 | 16 | 2M | 3.0 x 10<sup>-4</sup> | OPT-350M |
75
+ | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — |
76
+ | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 2M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
77
+ | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
78
+ | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
79
+ | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — |
80
+ <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
81
+ non-deduped models of a given size have the same hyperparameters. “Equivalent”
82
+ models have <b>exactly</b> the same architecture, and the same number of
83
+ non-embedding parameters.</figcaption>
84
+ </figure>
85
+
86
+ ## Uses and Limitations
87
+
88
+ ### Intended Use
89
+
90
+ The primary intended use of Pythia is research on the behavior, functionality,
91
+ and limitations of large language models. This suite is intended to provide
92
+ a controlled setting for performing scientific experiments. We also provide
93
+ 154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints
94
+ `step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to
95
+ `step143000`. These checkpoints are hosted on Hugging Face as branches. Note
96
+ that branch `143000` corresponds exactly to the model checkpoint on the `main`
97
+ branch of each model.
98
+
99
+ You may also further fine-tune and adapt Pythia-70M for deployment,
100
+ as long as your use is in accordance with the Apache 2.0 license. Pythia
101
+ models work with the Hugging Face [Transformers
102
+ Library](https://huggingface.co/docs/transformers/index). If you decide to use
103
+ pre-trained Pythia-70M as a basis for your fine-tuned model, please
104
+ conduct your own risk and bias assessment.
105
+
106
+ ### Out-of-scope use
107
+
108
+ The Pythia Suite is **not** intended for deployment. It is not a in itself
109
+ a product and cannot be used for human-facing interactions. For example,
110
+ the model may generate harmful or offensive text. Please evaluate the risks
111
+ associated with your particular use case.
112
+
113
+ Pythia models are English-language only, and are not suitable for translation
114
+ or generating text in other languages.
115
+
116
+ Pythia-70M has not been fine-tuned for downstream contexts in which
117
+ language models are commonly deployed, such as writing genre prose,
118
+ or commercial chatbots. This means Pythia-70M will **not**
119
+ respond to a given prompt the way a product like ChatGPT does. This is because,
120
+ unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
121
+ Learning from Human Feedback (RLHF) to better “follow” human instructions.
122
+
123
+ ### Limitations and biases
124
+
125
+ The core functionality of a large language model is to take a string of text
126
+ and predict the next token. The token used by the model need not produce the
127
+ most “accurate” text. Never rely on Pythia-70M to produce factually accurate
128
+ output.
129
+
130
+ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
131
+ known to contain profanity and texts that are lewd or otherwise offensive.
132
+ See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
133
+ discussion of documented biases with regards to gender, religion, and race.
134
+ Pythia-70M may produce socially unacceptable or undesirable text, *even if*
135
+ the prompt itself does not include anything explicitly offensive.
136
+
137
+ If you plan on using text generated through, for example, the Hosted Inference
138
+ API, we recommend having a human curate the outputs of this language model
139
+ before presenting it to other people. Please inform your audience that the
140
+ text was generated by Pythia-70M.
141
+
142
+ ### Quickstart
143
+
144
+ Pythia models can be loaded and used via the following code, demonstrated here
145
+ for the third `pythia-70m-deduped` checkpoint:
146
+
147
+ ```python
148
+ from transformers import GPTNeoXForCausalLM, AutoTokenizer
149
+
150
+ model = GPTNeoXForCausalLM.from_pretrained(
151
+ "EleutherAI/pythia-70m-deduped",
152
+ revision="step3000",
153
+ cache_dir="./pythia-70m-deduped/step3000",
154
+ )
155
+
156
+ tokenizer = AutoTokenizer.from_pretrained(
157
+ "EleutherAI/pythia-70m-deduped",
158
+ revision="step3000",
159
+ cache_dir="./pythia-70m-deduped/step3000",
160
+ )
161
+
162
+ inputs = tokenizer("Hello, I am", return_tensors="pt")
163
+ tokens = model.generate(**inputs)
164
+ tokenizer.decode(tokens[0])
165
+ ```
166
+
167
+ Revision/branch `step143000` corresponds exactly to the model checkpoint on
168
+ the `main` branch of each model.<br>
169
+ For more information on how to use all Pythia models, see [documentation on
170
+ GitHub](https://github.com/EleutherAI/pythia).
171
+
172
+ ## Training
173
+
174
+ ### Training data
175
+
176
+ [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
177
+ English. It was created by EleutherAI specifically for training large language
178
+ models. It contains texts from 22 diverse sources, roughly broken down into
179
+ five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
180
+ prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
181
+ miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
182
+ paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
183
+ methodology, and a discussion of ethical implications. Consult [the
184
+ datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
185
+ about the Pile and its component datasets. The Pile can be downloaded from
186
+ the [official website](https://pile.eleuther.ai/), or from a [community
187
+ mirror](https://the-eye.eu/public/AI/pile/).<br>
188
+ The Pile was **not** deduplicated before being used to train Pythia-70M.
189
+
190
+ ### Training procedure
191
+
192
+ All models were trained on the exact same data, in the exact same order. Each
193
+ model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
194
+ model are saved every 2,097,152,000 tokens, spaced evenly throughout training,
195
+ from `step1000` to `step143000` (which is the same as `main`). In addition, we
196
+ also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.
197
+ This corresponds to training for just under 1 epoch on the Pile for
198
+ non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
199
+
200
+ All *Pythia* models trained for 143000 steps at a batch size
201
+ of 2M (2,097,152 tokens).<br>
202
+ See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
203
+ procedure, including [how to reproduce
204
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
205
+ Pythia uses the same tokenizer as [GPT-NeoX-
206
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
207
+
208
+ ## Evaluations
209
+
210
+ All 16 *Pythia* models were evaluated using the [LM Evaluation
211
+ Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
212
+ the results by model and step at `results/json/*` in the [GitHub
213
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json/).<br>
214
+ Expand the sections below to see plots of evaluation results for all
215
+ Pythia and Pythia-deduped models compared with OPT and BLOOM.
216
+
217
+ <details>
218
+ <summary>LAMBADA – OpenAI</summary>
219
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai_v1.png" style="width:auto"/>
220
+ </details>
221
+
222
+ <details>
223
+ <summary>Physical Interaction: Question Answering (PIQA)</summary>
224
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa_v1.png" style="width:auto"/>
225
+ </details>
226
+
227
+ <details>
228
+ <summary>WinoGrande</summary>
229
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande_v1.png" style="width:auto"/>
230
+ </details>
231
+
232
+ <details>
233
+ <summary>AI2 Reasoning Challenge—Easy Set</summary>
234
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_easy_v1.png" style="width:auto"/>
235
+ </details>
236
+
237
+ <details>
238
+ <summary>SciQ</summary>
239
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq_v1.png" style="width:auto"/>
240
+ </details>
241
+
242
+ ## Changelog
243
+
244
+ This section compares differences between previously released
245
+ [Pythia v0](https://huggingface.co/models?other=pythia_v0) and the current
246
+ models. See Appendix B of the Pythia paper for further discussion of these
247
+ changes and the motivation behind them. We found that retraining Pythia had no
248
+ impact on benchmark performance.
249
+
250
+ - All model sizes are now trained with uniform batch size of 2M tokens.
251
+ Previously, the models of size 160M, 410M, and 1.4B parameters were trained
252
+ with batch sizes of 4M tokens.
253
+ - We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64,
254
+ 128,256,512} in addition to every 1000 training steps.
255
+ - Flash Attention was used in the new retrained suite.
256
+ - We remedied a minor inconsistency that existed in the original suite: all
257
+ models of size 2.8B parameters or smaller had a learning rate (LR) schedule
258
+ which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and
259
+ 12B models all used an LR schedule which decayed to a minimum LR of 0. In
260
+ the redone training runs, we rectified this inconsistency: all models now were
261
+ trained with LR decaying to a minimum of 0.1× their maximum LR.
262
+
263
+ ### Naming convention and parameter count
264
+
265
+ *Pythia* models were renamed in January 2023. It is possible that the old
266
+ naming convention still persists in some documentation by accident. The
267
+ current naming convention (70M, 160M, etc.) is based on total parameter count.
268
+
269
+ <figure style="width:32em">
270
+
271
+ | current Pythia suffix | old suffix | total params | non-embedding params |
272
+ | --------------------: | ---------: | -------------: | -------------------: |
273
+ | 70M | 19M | 70,426,624 | 18,915,328 |
274
+ | 160M | 125M | 162,322,944 | 85,056,000 |
275
+ | 410M | 350M | 405,334,016 | 302,311,424 |
276
+ | 1B | 800M | 1,011,781,632 | 805,736,448 |
277
+ | 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
278
+ | 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
279
+ | 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
280
+ | 12B | 13B | 11,846,072,320 | 11,327,027,200 |
281
+ </figure>
282
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
283
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_EleutherAI__pythia-70m)
284
+
285
+ | Metric | Value |
286
+ |-----------------------|---------------------------|
287
+ | Avg. | 25.28 |
288
+ | ARC (25-shot) | 21.59 |
289
+ | HellaSwag (10-shot) | 27.29 |
290
+ | MMLU (5-shot) | 25.9 |
291
+ | TruthfulQA (0-shot) | 47.06 |
292
+ | Winogrande (5-shot) | 51.46 |
293
+ | GSM8K (5-shot) | 0.3 |
294
+ | DROP (3-shot) | 3.33 |
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GPTNeoXForCausalLM"
4
+ ],
5
+ "bos_token_id": 0,
6
+ "eos_token_id": 0,
7
+ "hidden_act": "gelu",
8
+ "hidden_size": 512,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 2048,
11
+ "layer_norm_eps": 1e-05,
12
+ "max_position_embeddings": 2048,
13
+ "model_type": "gpt_neox",
14
+ "num_attention_heads": 8,
15
+ "num_hidden_layers": 6,
16
+ "rotary_emb_base": 10000,
17
+ "rotary_pct": 0.25,
18
+ "tie_word_embeddings": false,
19
+ "torch_dtype": "float16",
20
+ "transformers_version": "4.24.0",
21
+ "use_cache": true,
22
+ "use_parallel_residual": true,
23
+ "vocab_size": 50304
24
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ebfa4e2f18696ebd83716a0d39fe2c025f2ff8483f72a83ca59c475692fc9d15
3
+ size 166029852
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f629437d6fe950ffda3ef94f6a956a1bb63a2e79e03c296c92fe208999aeb092
3
+ size 166049099
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<|endoftext|>",
4
+ "eos_token": "<|endoftext|>",
5
+ "name_or_path": "EleutherAI/gpt-neox-20b",
6
+ "special_tokens_map_file": "/admin/home-hailey/.cache/huggingface/hub/models--EleutherAI--gpt-neox-20b/snapshots/4e49eadb5d14bd22f314ec3f45b69a87b88c7691/special_tokens_map.json",
7
+ "tokenizer_class": "GPTNeoXTokenizer",
8
+ "unk_token": "<|endoftext|>"
9
+ }