Model Card for RLPR-Qwen2.5-7B-Base
RLAIF-V/RLPR-Qwen2.5-7B-Base is trained from Qwen2.5-7B-Base with the RLPR framework, which eliminates reliance on external verifiers and is simple and scalable for general domains.
Model Details
Key Features
- π‘ Verifier-Free Reasoning Enhancement: RLPR pioneers reinforcement learning for reasoning tasks by leveraging the LLM's intrinsic generation probability as a direct reward signal. This eliminates the need for external verifiers and specialized fine-tuning, offering broad applicability and effectively handling complex, diverse answers.
- π οΈ Innovative Reward & Training Framework:
- Features a robust Probability-based Reward (PR) using average decoding probabilities of reference answers for higher quality, debiased reward signals, outperforming naive sequence likelihood.
- Implements an adaptive curriculum learning mechanism that dynamically filters prompts to stabilize training and significantly boost final performance.
- π Leading Performance in General & Mathematical Reasoning: Demonstrates substantial reasoning improvements across diverse benchmarks (e.g., 56.0 on MMLU-Pro, 55.4 on TheoremQA with Qwen2.5-7B). RLPR surpasses strong models reliant on external verifiers (like General Reasoner-7B) and other verifier-free approaches (like VeriFree).
Highlights
Existing RLVR methods rely on specialized verifiers for each domain, suffering from high complexity and limited scalability. Our RLPR framework, which replaces the complex verifier-based reward with a simple probability-based reward generated by the policy model . : input question, : generated reasoning content before final answer, : generated final answer, : reference answer. As shown in the example on the right side, rules and verifier models wrongly label both and as incorrect due to their limited capability of handling natural language complexity.
Model Description
- Trained from model: Qwen2.5-7B
- Trained on data: RLPR-Train
Usage
Usage adopted from Qwen2.5-7B-Instruct
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "RLAIF-V/RLPR-Qwen2.5-7B-Base"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "How much energy is produced when the sun converts one kg of hydrogen into helium?."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Citation
If you find our model/code/paper helpful, please consider cite our papers π:
@article{placeholder,
title={SCALING RLVR TO GENERAL DOMAIN WITHOUT VERIFIERS},
author={placeholder},
journal={placeholder},
year={2025},
}
- Downloads last month
- 5