Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
---
|
2 |
model-index:
|
3 |
-
- name: tulu-v2.5-
|
4 |
results: []
|
5 |
datasets:
|
6 |
- allenai/tulu-2.5-preference-data
|
@@ -14,7 +14,7 @@ license: apache-2.0
|
|
14 |
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-2.5/tulu_25_banner.png" alt="Tulu 2.5 banner image" width="800px"/>
|
15 |
</center>
|
16 |
|
17 |
-
# Model Card for Tulu V2.5 PPO 13B - UltraFeedback Mean w. 8B UltraFeedback RM
|
18 |
|
19 |
Tulu is a series of language models that are trained to act as helpful assistants.
|
20 |
Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
|
@@ -22,13 +22,14 @@ This model is trained on the UltraFeedback dataset (using the per-aspect/fine-gr
|
|
22 |
We used a 8B RM trained on the UltraFeedback dataset, and then used the UltraFeedback prompts during PPO training.
|
23 |
|
24 |
This is part of a small update to the original V2.5 suite, adding some Llama 3-based models. We add three models:
|
25 |
-
- [allenai/tulu-v2.5-
|
26 |
-
- [allenai/tulu-v2.5-
|
27 |
-
- [allenai/tulu-v2.5-
|
28 |
|
29 |
For more details, read the paper:
|
30 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
31 |
|
|
|
32 |
|
33 |
## .Model description
|
34 |
|
|
|
1 |
---
|
2 |
model-index:
|
3 |
+
- name: llama-3-tulu-v2.5-8b-uf-mean-8b-uf-rm
|
4 |
results: []
|
5 |
datasets:
|
6 |
- allenai/tulu-2.5-preference-data
|
|
|
14 |
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-2.5/tulu_25_banner.png" alt="Tulu 2.5 banner image" width="800px"/>
|
15 |
</center>
|
16 |
|
17 |
+
# Model Card for Llama 3 Tulu V2.5 PPO 13B - UltraFeedback Mean w. 8B UltraFeedback RM
|
18 |
|
19 |
Tulu is a series of language models that are trained to act as helpful assistants.
|
20 |
Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
|
|
|
22 |
We used a 8B RM trained on the UltraFeedback dataset, and then used the UltraFeedback prompts during PPO training.
|
23 |
|
24 |
This is part of a small update to the original V2.5 suite, adding some Llama 3-based models. We add three models:
|
25 |
+
- [allenai/llama-3-tulu-v2.5-8b-uf-mean-8b-uf-rm](https://huggingface.co/allenai/tulu-v2.5-llama3-8b-uf-mean-8b-uf-rm) (this model)
|
26 |
+
- [allenai/llama-3-tulu-v2.5-8b-uf-mean-70b-uf-rm-mixed-prompts](https://huggingface.co/allenai/tulu-v2.5-llama3-8b-uf-mean-70b-uf-rm-mixed-prompts)
|
27 |
+
- [allenai/llama-3-tulu-v2.5-8b-uf-mean-70b-uf-rm](https://huggingface.co/allenai/tulu-v2.5-llama3-8b-uf-mean-70b-uf-rm-mixed-prompts) (best overall model)
|
28 |
|
29 |
For more details, read the paper:
|
30 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
31 |
|
32 |
+
Built with Meta Llama 3! Note that Llama 3 is released under the Meta Llama 3 community license, included here under llama_3_license.txt.
|
33 |
|
34 |
## .Model description
|
35 |
|