ymcki
/

Llama-3.1-8B-SFT-GRPO-Instruct

Text Generation

text-generation-inference

Model card Files Files and versions

Llama-3.1-8B-SFT-GRPO-Instruct / README.md

ymcki's picture

Upload README.md

9d8bfc9 verified 8 months ago

|

history blame contribute delete

3.49 kB

	---
	base_model: meta-llama/Llama-3.1-8B-Instruct
	language:
	- multilingual
	datasets:
	- cognitivecomputations/dolphin-r1
	- openai/gsm8k
	library_name: transformers
	license: llama3.1
	license_link: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE
	pipeline_tag: text-generation
	tags:
	- nlp
	- code
	quantized_by: ymcki
	widget:
	- messages:
	- role: user
	content: Can you provide ways to eat combinations of bananas and dragonfruits?
	---

	Original model: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

	## Prompt format

	```
	<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>
	Cutting Knowledge Date: December 2023
	Today Date: 26 July 2024
	{system_prompt}<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>
	{prompt}<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	```

	By following the same procedure of Deepseek R1, [SFT](https://techcommunity.microsoft.com/blog/machinelearningblog/distillation-of-phi-4-on-deepseek-r1-sft-and-grpo/4381697) with Cognitive Computations' dolphin-r1 was performed first and then followed by Group Relative Policy Optimization (GRPO) with OpenAI gsm8k dataset. Two adapters are obtained and were applied to Llama-3.1-8B-Instruct to see if Reasoning and Math can be further improved.

	One epoch was run for the GRPO run. High reward average score for the last 53 steps was recorded at 0.96 epoch. The adapter is then applied to Llama-3.1-8B-Instruct.

	\| Epoch \| reward/format \| reward/correct \| reward/total \|
	\| ----- \| ------------- \| -------------- \| ------------ \|
	\| 0.52 \| 0.469783 \| 1.27358 \| 1.74337 \|
	\| 0.96 \| 0.750012 \| 1.10613 \| 1.85614 \|
	\| 1.00 \| 0.747508 \| 1.05425 \| 1.80175 \|

	This model is uploaded here to be evaluated by the Open LLM Leaderboard. Further GRPO fine tuning is currently underway to see further improvement is possible.

	## Benchmark (100.0*raw scores only)

	Click on the model name go to the raw score json generated by Open LLM Leaderboard.

	\| Model \| Average \| IFEval \| BHH \| Math Lv5 \| GPQA \| MUSR \| MMLU-PRO \|
	\| ----- \| ------- \| ------ \| ----\|--------- \| ---- \| ---- \| -------- \|
	\| [Llama-3.1-8B-Instruct](https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/meta-llama/Meta-Llama-3.1-8B-Instruct/results_2024-10-24T00-00-00.000000.json) \| 42.24 \| 80.48 \| 50.62 \| 19.34 \| 26.76 \| 38.62 \| 37.62 \|
	\| [Llama-3.1-8B-GRPO-Instruct](https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/ymcki/Llama-3.1-8B-GRPO-Instruct/results_2025-02-24T17-37-02.760485.json) \| 42.00 \| 75.61 \| 51.21 \| 20.24 \| 29.45 \| 38.10 \| 37.38 \|
	\| Llama-3.1-8B-SFT-GRPO-Instruct \| \| \| \| \| \| \| \|

	## How to run this model

	```py
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import transformers
	import torch

	model_id = "Llama-3.1-8B-SFT-GRPO-Instruct"
	dtype = torch.bfloat16

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="cuda",
	torch_dtype=dtype,)

	chat = [
	{ "role": "user", "content": "Write a hello world program" },
	]
	prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
	```

	## Downloading using huggingface-cli

	First, make sure you have hugginface-cli installed:

	```
	pip install -U "huggingface_hub[cli]"
	```

	Then, you can target the specific file you want:

	```
	huggingface-cli download ymcki/Llama-3.1-8B-SFT-GRPO-Instruct --include "*" --local-dir ./
	```

	## Credits

	Thanks Deepseek to develop the original GRPO method.