Qwen3-1.7B-GSM8K-GRPO-verl

This model is a fine-tuned version of Qwen/Qwen3-1.7B specifically adapted for the GSM8K dataset using Generative Reinforcement Learning with Policy Optimization (GRPO) via the Verl framework.

How to Use

You can use this model directly with the Hugging Face transformers library:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Makrrr/Qwen3-1.7B-GSM8K-GRPO-verl"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Recommended for H100/A100 (bfloat16) or try torch.float16 for other GPUs
    device_map="auto" # Automatically load model onto available GPU(s)
)
model.eval() # Set model to evaluation mode

prompt = "Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used.\nSolution:"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=256, # Generate up to 256 new tokens for the response
        num_return_sequences=1,
        do_sample=True, # Set to False for greedy decoding
        temperature=0.7, # Adjust temperature for creativity (0.0 for deterministic)
        top_p=0.9,       # Adjust top_p for nucleus sampling
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Training Details

This model was trained using the Verl framework with the following configuration:

Base Model: Qwen/Qwen3-1.7B
Dataset: GSM8K (train.parquet and test.parquet prepared via Verl's gsm8k.py script)
Optimization Algorithm: GRPO (Generative Reinforcement Learning with Policy Optimization)
Training Framework: Verl
Inference Backend: vLLM (for rollouts and reference model log probabilities)
Training Hardware: 1x NVIDIA H100 80GB HBM3 GPU
Total Epochs: 15

Final Evaluation Metrics

After completing 15 epochs of training, the model achieved the following final validation metric on the GSM8K test set:

mean@1: 0.8377558756633814 (approximately 83.78% average reward)

Makrrr
/

Qwen3-1.7B-GSM8K-GRPO-verl

Qwen3-1.7B-GSM8K-GRPO-verl

How to Use

Training Details

Final Evaluation Metrics

Model tree for Makrrr/Qwen3-1.7B-GSM8K-GRPO-verl

Dataset used to train Makrrr/Qwen3-1.7B-GSM8K-GRPO-verl