balnazzar
/

qwen-3B-r1-aha-v1

Model card Files Files and versions Community

Qwen-3B-R1-AHA-V1

This model was trained using GRPO (Group Relative Policy Optimization) on the Countdown Game task to develop reasoning capabilities.

Model Details

Base Model: Qwen/Qwen2.5-3B-Instruct
Training: GRPO with self-verification rewards
Task: Countdown Game mathematical reasoning

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("balnazzar/qwen-r1-aha")
tokenizer = AutoTokenizer.from_pretrained("balnazzar/qwen-r1-aha")

Training

Dataset: Countdown-Tasks-3to4
Reward Functions: Format checking and equation verification
Hardware: Nvidia A6000 (takes 45Gb)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support