Reverse Text Model Qwen3-0.6B

Simple model that was RL FT for 20 steps / epochs after SFT to reverse text using prime-rl (RL Training) and reverse-text (RL Environment). See the improvement in results:

Comparison with SFT (base) model

The reward (correctness score) distribution has improved for the RLFT model across all rollouts.

At an instance level, if we compare the best scores across rollouts, we see a mean improvement of 3.73%. But a maximum of ~30% and reduction of ~3%

Example Prompt & Reward

Task: reverse-text

Prompt:

  • System:
    “Reverse the text character-by-character. Put your answer in <reversed_text> tags.”
  • User:
    “The community in Bruck was merged into it”

Expected Completion:

<reversed_text>
.ti otni degrem saw kcuBr ni ytinummoc ehT
</reversed_text>

Expected Reward: 0.963855421686747

Note: Reward is basd on the long common subsequence

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sameersegal/Qwen3-0.6B-Reverse-Text-SFT-RLFT

Dataset used to train sameersegal/Qwen3-0.6B-Reverse-Text-SFT-RLFT