|
|
--- |
|
|
base_model: meta-llama/Llama-3.1-8B-Instruct |
|
|
language: |
|
|
- multilingual |
|
|
datasets: |
|
|
- cognitivecomputations/dolphin-r1 |
|
|
- openai/gsm8k |
|
|
library_name: transformers |
|
|
license: llama3.1 |
|
|
license_link: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- nlp |
|
|
- code |
|
|
quantized_by: ymcki |
|
|
widget: |
|
|
- messages: |
|
|
- role: user |
|
|
content: Can you provide ways to eat combinations of bananas and dragonfruits? |
|
|
--- |
|
|
|
|
|
Original model: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct |
|
|
|
|
|
## Prompt format |
|
|
|
|
|
``` |
|
|
<|begin_of_text|><|start_header_id|>system<|end_header_id|> |
|
|
Cutting Knowledge Date: December 2023 |
|
|
Today Date: 26 July 2024 |
|
|
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|> |
|
|
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|> |
|
|
|
|
|
``` |
|
|
|
|
|
By following the same procedure of Deepseek R1, [SFT](https://techcommunity.microsoft.com/blog/machinelearningblog/distillation-of-phi-4-on-deepseek-r1-sft-and-grpo/4381697) with Cognitive Computations' dolphin-r1 was performed first and then followed by Group Relative Policy Optimization (GRPO) with OpenAI gsm8k dataset. Two adapters are obtained and were applied to Llama-3.1-8B-Instruct to see if Reasoning and Math can be further improved. |
|
|
|
|
|
One epoch was run for the GRPO run. High reward average score for the last 53 steps was recorded at 0.96 epoch. The adapter is then applied to Llama-3.1-8B-Instruct. |
|
|
|
|
|
| Epoch | reward/format | reward/correct | reward/total | |
|
|
| ----- | ------------- | -------------- | ------------ | |
|
|
| 0.52 | 0.469783 | 1.27358 | 1.74337 | |
|
|
| 0.96 | 0.750012 | 1.10613 | 1.85614 | |
|
|
| 1.00 | 0.747508 | 1.05425 | 1.80175 | |
|
|
|
|
|
This model is uploaded here to be evaluated by the Open LLM Leaderboard. Further GRPO fine tuning is currently underway to see further improvement is possible. |
|
|
|
|
|
## Benchmark (100.0*raw scores only) |
|
|
|
|
|
Click on the model name go to the raw score json generated by Open LLM Leaderboard. |
|
|
|
|
|
| Model | Average | IFEval | BHH | Math Lv5 | GPQA | MUSR | MMLU-PRO | |
|
|
| ----- | ------- | ------ | ----|--------- | ---- | ---- | -------- | |
|
|
| [Llama-3.1-8B-Instruct](https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/meta-llama/Meta-Llama-3.1-8B-Instruct/results_2024-10-24T00-00-00.000000.json) | 42.24 | 80.48 | 50.62 | 19.34 | 26.76 | 38.62 | 37.62 | |
|
|
| [Llama-3.1-8B-GRPO-Instruct](https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/ymcki/Llama-3.1-8B-GRPO-Instruct/results_2025-02-24T17-37-02.760485.json) | 42.00 | 75.61 | 51.21 | 20.24 | 29.45 | 38.10 | 37.38 | |
|
|
| Llama-3.1-8B-SFT-GRPO-Instruct | | | | | | | | |
|
|
|
|
|
## How to run this model |
|
|
|
|
|
```py |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import transformers |
|
|
import torch |
|
|
|
|
|
model_id = "Llama-3.1-8B-SFT-GRPO-Instruct" |
|
|
dtype = torch.bfloat16 |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map="cuda", |
|
|
torch_dtype=dtype,) |
|
|
|
|
|
chat = [ |
|
|
{ "role": "user", "content": "Write a hello world program" }, |
|
|
] |
|
|
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
|
|
``` |
|
|
|
|
|
## Downloading using huggingface-cli |
|
|
|
|
|
First, make sure you have hugginface-cli installed: |
|
|
|
|
|
``` |
|
|
pip install -U "huggingface_hub[cli]" |
|
|
``` |
|
|
|
|
|
Then, you can target the specific file you want: |
|
|
|
|
|
``` |
|
|
huggingface-cli download ymcki/Llama-3.1-8B-SFT-GRPO-Instruct --include "*" --local-dir ./ |
|
|
``` |
|
|
|
|
|
## Credits |
|
|
|
|
|
Thanks Deepseek to develop the original GRPO method. |
|
|
|
|
|
|