Papers
arxiv:2504.09643

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Published on Apr 13
· Submitted by mponty on Apr 15
Authors:
,
,

Abstract

Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.

Community

Paper submitter

"Generating high-quality code that solves complex program-
ming tasks is challenging, especially with current decoder-based models
that produce highly stochastic outputs. In code generation, even minor
errors can easily break the entire solution. Leveraging multiple sampled
solutions can significantly improve the overall output quality.
One effective way to enhance code generation is by pairing a code gener-
ation model with a reranker model, which selects the best solution from
the generated samples. We propose a novel iterative self-training ap-
proach for self-training reranker models using Proximal Policy Optimiza-
tion (PPO), aimed at improving both reranking accuracy and the overall
code generation process. Unlike traditional PPO approaches, where the
focus is on optimizing a generative model with a reward model, our ap-
proach emphasizes the development of a robust reward/reranking model.
This model improves the quality of generated code through reranking and
addresses problems and errors that the reward model might overlook
during PPO alignment with the reranker. Our method iteratively refines
the training dataset by re-evaluating outputs, identifying high-scoring
negative examples, and incorporating them into the training loop, that
boosting model performance.
Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B
parameter model outperforms a 33B model in code generation quality
while being three times faster. Moreover, it achieves performance com-
parable to GPT-4 and surpasses it in one programming language."

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.09643 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.09643 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.09643 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.