GCPO: When Contrast Fails, Go Gold
Read the paper on arxiv: 👉 https://arxiv.org/abs/2510.07790
github:https://github.com/AchoWu/GCPO
GCPO (Group Contrastive Policy Optimization) is a novel reinforcement learning algorithm designed to enhance the reasoning capabilities of language models, especially in scenarios where the model fails to generate correct responses. Unlike previous methods like GRPO, which rely solely on the model’s own rollouts, GCPO introduces Golden Answers (GAs) — external reference answers — to guide the model’s updates when all sampled responses are incorrect.
This approach ensures:
✅ Full sample utilization — no training data is wasted
🧠 Knowledge transfer — small models learn reasoning strategies from larger models
🚀 Faster convergence and better generalization
🛠️ Model Use
✅ Use with Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Ach0/GCPO-R1-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", trust_remote_code=True)
question = """
Solve the following math problem efficiently and clearly. The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{{ANSWER}}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering.
Point $B$ is on $\\overline{AC}$ with $AB = 9$ and $BC = 21.$ Point $D$ is not on $\\overline{AC}$ so that $AD = CD,$ and $AD$ and $BD$ are integers. Let $s$ be the sum of all possible perimeters of $\\triangle ACD$. Find $s.$
"""
messages = [
{"role": "user", "content": question}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
✅ Use with vLLM(fast inference)
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_name = "Ach0/GCPO-R1-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(model=model_name, trust_remote_code=True)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.8,
top_k=20,
max_tokens=8192
)
question = """
Solve the following math problem efficiently and clearly. The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{{ANSWER}}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering.
Point $B$ is on $\\overline{AC}$ with $AB = 9$ and $BC = 21.$ Point $D$ is not on $\\overline{AC}$ so that $AD = CD,$ and $AD$ and $BD$ are integers. Let $s$ be the sum of all possible perimeters of $\\triangle ACD$. Find $s.$
"""
messages = [
{"role": "user", "content": question}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
📊 GCPO Improves Reasoning Performance
GCPO consistently outperforms DAPO.
- Downloads last month
- 18
Model tree for Ach0/GCPO-R1-1.5B
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B