File size: 1,280 Bytes
5c7566f
 
 
 
 
 
 
 
 
 
dda6c99
5c7566f
f1006e5
08a8e65
45a3583
 
 
08a8e65
5c7566f
45a3583
 
 
dda6c99
5c7566f
dda6c99
5c7566f
dda6c99
 
 
 
 
5c7566f
dda6c99
 
 
 
 
 
5c7566f
7c7d4f8
dda6c99
7c7d4f8
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
license: mit
datasets:
- PrimeIntellect/Reverse-Text-RL
language:
- en
base_model:
- Qwen/Qwen3-0.6B
- PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT
---
# Reverse Text Model Qwen3-0.6B

Simple model that was RL FT for 20 steps / epochs after SFT to reverse text using [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl/) (RL Training) and [reverse-text](https://github.com/PrimeIntellect-ai/prime-environments/tree/main/environments/reverse_text) (RL Environment). See the improvement in results:

## Comparison with SFT (base) model

The reward (correctness score) distribution has improved for the RLFT model across all rollouts.
![](comparison.png)

At an instance level, if we compare the best scores across rollouts, we see a mean improvement of 3.73%. But a maximum of ~30% and reduction of ~3%
![](instance-level.png)

## Example Prompt & Reward

**Task:** `reverse-text`

**Prompt:**
- **System:**  
  “Reverse the text character-by-character. Put your answer in `<reversed_text>` tags.”
- **User:**  
  “The community in Bruck was merged into it”

**Expected Completion:**  
```text
<reversed_text>
.ti otni degrem saw kcuBr ni ytinummoc ehT
</reversed_text>
```

**Expected Reward:**
0.963855421686747

Note: Reward is basd on the long common subsequence