roadz commited on
Commit
3052732
·
verified ·
1 Parent(s): 6d1fe3c

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -331
README.md DELETED
@@ -1,331 +0,0 @@
1
- ---
2
- license: llama3.1
3
- language:
4
- - en
5
- pipeline_tag: text-generation
6
- datasets:
7
- - allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
8
- base_model:
9
- - allenai/Llama-3.1-Tulu-3-8B-DPO
10
- library_name: transformers
11
- model-index:
12
- - name: Llama-3.1-Tulu-3-8B
13
- results:
14
- - task:
15
- type: text-generation
16
- name: Text Generation
17
- dataset:
18
- name: IFEval (0-Shot)
19
- type: wis-k/instruction-following-eval
20
- split: train
21
- args:
22
- num_few_shot: 0
23
- metrics:
24
- - type: inst_level_strict_acc and prompt_level_strict_acc
25
- value: 82.55
26
- name: averaged accuracy
27
- source:
28
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
29
- name: Open LLM Leaderboard
30
- - task:
31
- type: text-generation
32
- name: Text Generation
33
- dataset:
34
- name: BBH (3-Shot)
35
- type: SaylorTwift/bbh
36
- split: test
37
- args:
38
- num_few_shot: 3
39
- metrics:
40
- - type: acc_norm
41
- value: 16.86
42
- name: normalized accuracy
43
- source:
44
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
45
- name: Open LLM Leaderboard
46
- - task:
47
- type: text-generation
48
- name: Text Generation
49
- dataset:
50
- name: MATH Lvl 5 (4-Shot)
51
- type: lighteval/MATH-Hard
52
- split: test
53
- args:
54
- num_few_shot: 4
55
- metrics:
56
- - type: exact_match
57
- value: 18.88
58
- name: exact match
59
- source:
60
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
61
- name: Open LLM Leaderboard
62
- - task:
63
- type: text-generation
64
- name: Text Generation
65
- dataset:
66
- name: GPQA (0-shot)
67
- type: Idavidrein/gpqa
68
- split: train
69
- args:
70
- num_few_shot: 0
71
- metrics:
72
- - type: acc_norm
73
- value: 6.26
74
- name: acc_norm
75
- source:
76
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
77
- name: Open LLM Leaderboard
78
- - task:
79
- type: text-generation
80
- name: Text Generation
81
- dataset:
82
- name: MuSR (0-shot)
83
- type: TAUR-Lab/MuSR
84
- args:
85
- num_few_shot: 0
86
- metrics:
87
- - type: acc_norm
88
- value: 10.52
89
- name: acc_norm
90
- source:
91
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
92
- name: Open LLM Leaderboard
93
- - task:
94
- type: text-generation
95
- name: Text Generation
96
- dataset:
97
- name: MMLU-PRO (5-shot)
98
- type: TIGER-Lab/MMLU-Pro
99
- config: main
100
- split: test
101
- args:
102
- num_few_shot: 5
103
- metrics:
104
- - type: acc
105
- value: 20.23
106
- name: accuracy
107
- source:
108
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=allenai%2FLlama-3.1-Tulu-3-8B
109
- name: Open LLM Leaderboard
110
- ---
111
-
112
- <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu3/Tulu3-logo.png" alt="Tulu 3 banner" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
113
-
114
- # Llama-3.1-Tulu-3-8B
115
-
116
- Tülu3 is a leading instruction following model family, offering fully open-source data, code, and recipes designed to serve as a comprehensive guide for modern post-training techniques.
117
- Tülu3 is designed for state-of-the-art performance on a diversity of tasks in addition to chat, such as MATH, GSM8K, and IFEval.
118
-
119
- ## Model description
120
-
121
- - **Model type:** A model trained on a mix of publicly available, synthetic and human-created datasets.
122
- - **Language(s) (NLP):** Primarily English
123
- - **License:** Llama 3.1 Community License Agreement
124
- - **Finetuned from model:** allenai/Llama-3.1-Tulu-3-8B-DPO
125
-
126
- ### Model Sources
127
-
128
- - **Training Repository:** https://github.com/allenai/open-instruct
129
- - **Eval Repository:** https://github.com/allenai/olmes
130
- - **Paper:** https://arxiv.org/abs/2411.15124
131
- - **Demo:** https://playground.allenai.org/
132
-
133
- ### Model Family
134
-
135
- | **Stage** | **Llama 3.1 8B** | **Llama 3.1 70B** |
136
- |----------------------|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
137
- | **Base Model** | [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B) |
138
- | **SFT** | [allenai/Llama-3.1-Tulu-3-8B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) | [allenai/Llama-3.1-Tulu-3-70B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-SFT) |
139
- | **DPO** | [allenai/Llama-3.1-Tulu-3-8B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO) | [allenai/Llama-3.1-Tulu-3-70B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-DPO) |
140
- | **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
141
- | **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
142
-
143
- | **Stage** | **Llama 3.1 405B** |
144
- |-----------|-------------------|
145
- | **Base Model** | [meta-llama/llama-3.1-405B](https://huggingface.co/meta-llama/llama-3.1-405B) |
146
- | **SFT** | [allenai/llama-3.1-Tulu-3-405B-SFT](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-SFT) |
147
- | **DPO** | [allenai/llama-3.1-Tulu-3-405B-DPO](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-DPO) |
148
- | **Final Model (RLVR)** | [allenai/llama-3.1-Tulu-3-405B](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B) |
149
- | **Reward Model (RM)**| (Same as 8B)
150
-
151
-
152
- ## Using the model
153
-
154
- ### Loading with HuggingFace
155
-
156
- To load the model with HuggingFace, use the following snippet:
157
- ```
158
- from transformers import AutoModelForCausalLM
159
-
160
- tulu_model = AutoModelForCausalLM.from_pretrained("allenai/Llama-3.1-Tulu-3-8B")
161
- ```
162
-
163
- ### VLLM
164
-
165
- As a Llama base model, the model can be easily served with:
166
- ```
167
- vllm serve allenai/Llama-3.1-Tulu-3-8B
168
- ```
169
- Note that given the long chat template of Llama, you may want to use `--max_model_len=8192`.
170
-
171
- ### Chat template
172
-
173
- The chat template for our models is formatted as:
174
- ```
175
- <|user|>\nHow are you doing?\n<|assistant|>\nI'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>
176
- ```
177
- Or with new lines expanded:
178
- ```
179
- <|user|>
180
- How are you doing?
181
- <|assistant|>
182
- I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>
183
- ```
184
- It is embedded within the tokenizer as well, for `tokenizer.apply_chat_template`.
185
-
186
- ### System prompt
187
-
188
- In Ai2 demos, we use this system prompt by default:
189
- ```
190
- You are Tulu 3, a helpful and harmless AI Assistant built by the Allen Institute for AI.
191
- ```
192
- The model has not been trained with a specific system prompt in mind.
193
-
194
- ### Bias, Risks, and Limitations
195
-
196
- The Tülu3 models have limited safety training, but are not deployed automatically with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so).
197
- It is also unknown what the size and composition of the corpus was used to train the base Llama 3.1 models, however it is likely to have included a mix of Web data and technical sources like books and code.
198
- See the Falcon 180B model card for an example of this.
199
-
200
-
201
- ## Performance
202
-
203
- | Benchmark (eval) | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
204
- |---------------------------------|----------------|----------------|------------|------------------------|----------------------|-----------|---------------------|-----------------------|
205
- | **Avg.** | 60.4 | 64.4 | **64.8** | 62.2 | 57.8 | 44.7 | 55.2 | 58.3 |
206
- | **MMLU (0 shot, CoT)** | 65.9 | 68.7 | 68.2 | 71.2 | **76.6** | 62.0 | 74.6 | 68.5 |
207
- | **PopQA (15 shot)** | **29.3** | 29.3 | 29.1 | 20.2 | 18.1 | 22.5 | 28.3 | 20.2 |
208
- | **TruthfulQA (6 shot)** | 46.8 | 56.1 | 55.0 | 55.1 | **63.1** | 57.0 | 61.4 | 55.5 |
209
- | **BigBenchHard (3 shot, CoT)** | **67.9** | 65.8 | 66.0 | 62.8 | 21.7 | 0.9 | 2.5 | 56.2 |
210
- | **DROP (3 shot)** | 61.3 | 62.5 | **62.6** | 61.5 | 54.4 | 49.4 | 58.8 | 56.2 |
211
- | **MATH (4 shot CoT, Flex)** | 31.5 | 42.0 | **43.7** | 42.5 | 14.8 | 5.1 | 29.8 | 40.0 |
212
- | **GSM8K (8 shot, CoT)** | 76.2 | 84.3 | **87.6** | 83.4 | 83.8 | 61.2 | 79.7 | 80.0 |
213
- | **HumanEval (pass@10)** | 86.2 | 83.9 | 83.9 | 86.3 | **93.1** | 75.4 | 71.7 | 91.0 |
214
- | **HumanEval+ (pass@10)** | 81.4 | 78.6 | 79.2 | 82.9 | **89.7** | 69.1 | 67.0 | 88.5 |
215
- | **IFEval (prompt loose)** | 72.8 | 81.1 | **82.4** | 80.6 | 74.7 | 38.8 | 69.9 | 56.4 |
216
- | **AlpacaEval 2 (LC % win)** | 12.4 | 33.5 | 34.5 | 24.2 | 29.0 | **49.0** | 43.7 | 31.4 |
217
- | **Safety (6 task avg.)** | **93.1** | 87.2 | 85.5 | 75.2 | 75.0 | 46.4 | 75.5 | 56.2 |
218
-
219
- | Benchmark (eval) | Tülu 3 70B SFT | Tülu 3 DPO 70B | Tülu 3 70B | Llama 3.1 70B Instruct | Qwen 2.5 72B Instruct | Hermes 3 Llama 3.1 70B | Nemotron Llama 3.1 70B |
220
- |---------------------------------|-----------------|-----------------|-------------|-------------------------|-----------------------|------------------------|-------------------------|
221
- | **Avg.** | 72.6 | 75.9 | **76.0** | 73.4 | 71.5 | 68.3 | 65.5 |
222
- | **MMLU (0 shot, CoT)** | 78.9 | 83.3 | 83.1 | 85.3 | **85.5** | 80.4 | 83.8 |
223
- | **PopQA (15 shot)** | **48.6** | 46.3 | 46.5 | 46.4 | 30.6 | 48.1 | 36.4 |
224
- | **TruthfulQA (6 shot)** | 55.7 | 67.9 | 67.6 | 66.8 | **69.9** | 66.5 | 62.6 |
225
- | **BigBenchHard (3 shot, CoT)** | **82.7** | 81.8 | 82.0 | 73.8 | 67.2 | 82.1 | 0.7 |
226
- | **DROP (3 shot)** | **77.2** | 74.1 | 74.3 | 77.0 | 34.2 | 73.2 | 68.8 |
227
- | **MATH (4 shot CoT, Flex)** | 53.7 | 62.3 | 63.0 | 56.4 | **74.3** | 41.9 | 55.0 |
228
- | **GSM8K (8 shot, CoT)** | 91.1 | 93.5 | 93.5 | **93.7** | 89.5 | 90.0 | 84.7 |
229
- | **HumanEval (pass@10)** | 92.9 | 92.4 | 92.4 | 93.6 | 94.0 | 89.6 | **94.1** |
230
- | **HumanEval+ (pass@10)** | 87.3 | 88.4 | 88.0 | 89.5 | **90.8** | 85.9 | 85.5 |
231
- | **IFEval (prompt loose)** | 82.1 | 82.6 | 83.2 | **88.0** | 87.6 | 76.0 | 79.9 |
232
- | **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
233
- | **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
234
- | Benchmark (eval) | Tülu 3 405B SFT | Tülu 3 405B DPO | Tülu 3 405B | Llama 3.1 405B Instruct | Nous Hermes 3 405B | Deepseek V3 | GPT 4o (11-24) |
235
- |-----------------|----------------|----------------|-------------|------------------------|-------------------|-------------|----------------|
236
- | **Avg w/o Safety** | 76.3 | 79.0 | 80.0 | 78.1 | 74.4 | 79.0 | **80.5** |
237
- | **Avg w/ Safety** | 77.5 | 79.6 | 80.7 | 79.0 | 73.5 | 75.9 | **81.6** |
238
- | **MMLU (5 shot, CoT)** | 84.4 | 86.6 | 87.0 | **88.0** | 84.9 | 82.1 | 87.9 |
239
- | **PopQA (3 shot)** | **55.7** | 55.4 | 55.5 | 52.9 | 54.2 | 44.9 | 53.6 |
240
- | **BigBenchHard (0 shot, CoT)** | 88.0 | 88.8 | 88.6 | 87.1 | 87.7 | **89.5** | 83.3 |
241
- | **MATH (4 shot, Flex)** | 63.4 | 59.9 | 67.3 | 66.6 | 58.4 | **72.5** | 68.8 |
242
- | **GSM8K (8 shot, CoT)** | 93.6 | 94.2 | **95.5** | 95.4 | 92.7 | 94.1 | 91.7 |
243
- | **HumanEval (pass@10)** | 95.7 | **97.2** | 95.9 | 95.9 | 92.3 | 94.6 | 97.0 |
244
- | **HumanEval+ (pass@10)** | 93.3 | **93.9** | 92.9 | 90.3 | 86.9 | 91.6 | 92.7 |
245
- | **IFEval (prompt loose)** | 82.4 | 85.0 | 86.0 | **88.4** | 81.9 | 88.0 | 84.8 |
246
- | **AlpacaEval 2 (LC % win)** | 30.4 | 49.8 | 51.4 | 38.5 | 30.2 | 53.5 | **65.0** |
247
- | **Safety (6 task avg.)** | 87.7 | 85.5 | 86.7 | 86.8 | 65.8 | 72.2 | **90.9** |
248
-
249
-
250
- ## Hyperparamters
251
-
252
- PPO settings for RLVR:
253
- - **Learning Rate**: 3 × 10⁻⁷
254
- - **Discount Factor (gamma)**: 1.0
255
- - **General Advantage Estimation (lambda)**: 0.95
256
- - **Mini-batches (N_mb)**: 1
257
- - **PPO Update Iterations (K)**: 4
258
- - **PPO's Clipping Coefficient (epsilon)**: 0.2
259
- - **Value Function Coefficient (c1)**: 0.1
260
- - **Gradient Norm Threshold**: 1.0
261
- - **Learning Rate Schedule**: Linear
262
- - **Generation Temperature**: 1.0
263
- - **Batch Size (effective)**: 224
264
- - **Max Token Length**: 2,048
265
- - **Max Prompt Token Length**: 2,048
266
- - **Penalty Reward Value for Responses without an EOS Token**: -10.0
267
- - **Response Length**: 2,048
268
- - **Total Episodes**: 100,000
269
- - **KL penalty coefficient (beta)**: 0.05
270
- - **Warm up ratio (omega)**: 0.0
271
-
272
- ## License and use
273
-
274
- All Llama 3.1 Tülu3 models are released under Meta's [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/).
275
- Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc.
276
- Tülu3 is intended for research and educational use.
277
- For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).
278
-
279
- The models have been fine-tuned using a dataset mix with outputs generated from third party models and are subject to additional terms:
280
- [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and [Qwen License Agreement](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE) (models were improved using Qwen 2.5).
281
-
282
-
283
- ## Citation
284
-
285
- If Tülu3 or any of the related materials were helpful to your work, please cite:
286
- ```
287
- @article{lambert2024tulu3,
288
- title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
289
- author = {
290
- Nathan Lambert and
291
- Jacob Morrison and
292
- Valentina Pyatkin and
293
- Shengyi Huang and
294
- Hamish Ivison and
295
- Faeze Brahman and
296
- Lester James V. Miranda and
297
- Alisa Liu and
298
- Nouha Dziri and
299
- Shane Lyu and
300
- Yuling Gu and
301
- Saumya Malik and
302
- Victoria Graf and
303
- Jena D. Hwang and
304
- Jiangjiang Yang and
305
- Ronan Le Bras and
306
- Oyvind Tafjord and
307
- Chris Wilhelm and
308
- Luca Soldaini and
309
- Noah A. Smith and
310
- Yizhong Wang and
311
- Pradeep Dasigi and
312
- Hannaneh Hajishirzi
313
- },
314
- year = {2024},
315
- email = {[email protected]}
316
- }
317
- ```
318
- # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
319
- Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/allenai__Llama-3.1-Tulu-3-8B-details)!
320
- Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=allenai%2FLlama-3.1-Tulu-3-8B&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!
321
-
322
- | Metric |Value (%)|
323
- |-------------------|--------:|
324
- |**Average** | 25.88|
325
- |IFEval (0-Shot) | 82.55|
326
- |BBH (3-Shot) | 16.86|
327
- |MATH Lvl 5 (4-Shot)| 18.88|
328
- |GPQA (0-shot) | 6.26|
329
- |MuSR (0-shot) | 10.52|
330
- |MMLU-PRO (5-shot) | 20.23|
331
-