Qwen-3B-R1-AHA-V1

This model was trained using GRPO (Group Relative Policy Optimization) on the Countdown Game task to develop reasoning capabilities.

Model Details

  • Base Model: Qwen/Qwen2.5-3B-Instruct
  • Training: GRPO with self-verification rewards
  • Task: Countdown Game mathematical reasoning

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("balnazzar/qwen-r1-aha")
tokenizer = AutoTokenizer.from_pretrained("balnazzar/qwen-r1-aha")

Training

  • Dataset: Countdown-Tasks-3to4
  • Reward Functions: Format checking and equation verification
  • Hardware: Nvidia A6000 (takes 45Gb)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.