Read the paper on arxiv: 👉 https://arxiv.org/abs/2510.07790

github:https://github.com/AchoWu/GCPO

GCPO (Group Contrastive Policy Optimization) is a novel reinforcement learning algorithm designed to enhance the reasoning capabilities of language models, especially in scenarios where the model fails to generate correct responses. Unlike previous methods like GRPO, which rely solely on the model’s own rollouts, GCPO introduces Golden Answers (GAs) — external reference answers — to guide the model’s updates when all sampled responses are incorrect.

This approach ensures:

✅ Full sample utilization — no training data is wasted
🧠 Knowledge transfer — small models learn reasoning strategies from larger models
🚀 Faster convergence and better generalization

🛠️ Model Use

✅ Use with Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Ach0/GCPO-R1-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", trust_remote_code=True)

question = """
Solve the following math problem efficiently and clearly.  The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{{ANSWER}}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering.

Point $B$ is on $\\overline{AC}$ with $AB = 9$ and $BC = 21.$ Point $D$ is not on $\\overline{AC}$ so that $AD = CD,$ and $AD$ and $BD$ are integers. Let $s$ be the sum of all possible perimeters of $\\triangle ACD$. Find $s.$
"""

messages = [
    {"role": "user", "content": question}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✅ Use with vLLM(fast inference)

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_name = "Ach0/GCPO-R1-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(model=model_name, trust_remote_code=True)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    max_tokens=8192
)

question = """
Solve the following math problem efficiently and clearly.  The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{{ANSWER}}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering.

Point $B$ is on $\\overline{AC}$ with $AB = 9$ and $BC = 21.$ Point $D$ is not on $\\overline{AC}$ so that $AD = CD,$ and $AD$ and $BD$ are integers. Let $s$ be the sum of all possible perimeters of $\\triangle ACD$. Find $s.$
"""

messages = [
    {"role": "user", "content": question}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

📊 GCPO Improves Reasoning Performance

GCPO consistently outperforms DAPO.

Downloads last month: 18

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Ach0/GCPO-R1-1.5B

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Finetuned

(476)

this model

Quantizations

2 models