Qwen3-1.7B-GSM8K-GRPO-verl
This model is a fine-tuned version of Qwen/Qwen3-1.7B
specifically adapted for the GSM8K dataset using Generative Reinforcement Learning with Policy Optimization (GRPO) via the Verl framework.
How to Use
You can use this model directly with the Hugging Face transformers
library:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Makrrr/Qwen3-1.7B-GSM8K-GRPO-verl"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # Recommended for H100/A100 (bfloat16) or try torch.float16 for other GPUs
device_map="auto" # Automatically load model onto available GPU(s)
)
model.eval() # Set model to evaluation mode
prompt = "Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used.\nSolution:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
with torch.no_grad():
output_ids = model.generate(
input_ids,
max_new_tokens=256, # Generate up to 256 new tokens for the response
num_return_sequences=1,
do_sample=True, # Set to False for greedy decoding
temperature=0.7, # Adjust temperature for creativity (0.0 for deterministic)
top_p=0.9, # Adjust top_p for nucleus sampling
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Training Details
This model was trained using the Verl framework with the following configuration:
- Base Model: Qwen/Qwen3-1.7B
- Dataset: GSM8K (train.parquet and test.parquet prepared via Verl's gsm8k.py script)
- Optimization Algorithm: GRPO (Generative Reinforcement Learning with Policy Optimization)
- Training Framework: Verl
- Inference Backend: vLLM (for rollouts and reference model log probabilities)
- Training Hardware: 1x NVIDIA H100 80GB HBM3 GPU
- Total Epochs: 15
Final Evaluation Metrics
After completing 15 epochs of training, the model achieved the following final validation metric on the GSM8K test set:
mean@1: 0.8377558756633814 (approximately 83.78% average reward)
- Downloads last month
- 39