ymcki
/

Llama-3.1-8B-SFT-GRPO-Instruct

+---
+base_model: meta-llama/Llama-3.1-8B-Instruct
+language:
+- multilingual
+datasets:
+  - cognitivecomputations/dolphin-r1
+  - openai/gsm8k
+library_name: transformers
+license: llama3.1
+license_link: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE
+pipeline_tag: text-generation
+tags:
+- nlp
+- code
+quantized_by: ymcki
+widget:
+- messages:
+  - role: user
+    content: Can you provide ways to eat combinations of bananas and dragonfruits?
+---
+Original model: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
+## Prompt format
+```
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>
+Cutting Knowledge Date: December 2023
+Today Date: 26 July 2024
+{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
+{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+```
+By following the same procedure of Deepseek R1, [SFT](https://techcommunity.microsoft.com/blog/machinelearningblog/distillation-of-phi-4-on-deepseek-r1-sft-and-grpo/4381697) with Cognitive Computations' dolphin-r1 was performed first and then followed by Group Relative Policy Optimization (GRPO) with OpenAI gsm8k dataset. Two adapters are obtained and were applied to Llama-3.1-8B-Instruct to see if Reasoning and Math can be further improved.
+One epoch was run for the GRPO run. High reward average score for the last 53 steps was recorded at 0.96 epoch. The adapter is then applied to Llama-3.1-8B-Instruct.
+| Epoch | reward/format | reward/correct | reward/total |
+| ----- | ------------- | -------------- | ------------ |
+| 0.52 | 0.469783 | 1.27358 | 1.74337 |
+| 0.96 | 0.750012 | 1.10613 | 1.85614 |
+| 1.00 | 0.747508 | 1.05425 | 1.80175 |
+This model is uploaded here to be evaluated by the Open LLM Leaderboard. Further GRPO fine tuning is currently underway to see further improvement is possible.
+## Benchmark (100.0*raw scores only)
+Click on the model name go to the raw score json generated by Open LLM Leaderboard.
+| Model | Average | IFEval | BHH | Math Lv5 | GPQA | MUSR | MMLU-PRO |
+| ----- | ------- | ------ | ----|--------- | ---- | ---- | -------- |
+| [Llama-3.1-8B-Instruct](https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/meta-llama/Meta-Llama-3.1-8B-Instruct/results_2024-10-24T00-00-00.000000.json) | 42.24 | 80.48 | 50.62 | 19.34 | 26.76 | 38.62 | 37.62 |
+| [Llama-3.1-8B-GRPO-Instruct](https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/ymcki/Llama-3.1-8B-GRPO-Instruct/results_2025-02-24T17-37-02.760485.json) | 42.00 | 75.61 | 51.21 | 20.24 | 29.45 | 38.10 | 37.38 |
+| Llama-3.1-8B-SFT-GRPO-Instruct | | | | | | | |
+Gain in reasoning and math is offset by instruction following.
+## How to run this model
+```py
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+model_id = "Llama-3.1-8B-SFT-GRPO-Instruct"
+dtype = torch.bfloat16
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="cuda",
+    torch_dtype=dtype,)
+chat = [
+    { "role": "user", "content": "Write a hello world program" },
+]
+prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+```
+## Downloading using huggingface-cli
+First, make sure you have hugginface-cli installed:
+```
+pip install -U "huggingface_hub[cli]"
+```
+Then, you can target the specific file you want:
+```
+huggingface-cli download ymcki/Llama-3.1-8B-SFT-GRPO-Instruct --include "*" --local-dir ./
+```
+## Credits
+Thanks Deepseek to develop the original GRPO method.