File size: 5,504 Bytes
7bbf7f8
 
98ab9d7
 
7bbf7f8
44da050
98ab9d7
 
 
 
 
 
 
7bbf7f8
 
98ab9d7
7bbf7f8
98ab9d7
7bbf7f8
98ab9d7
7bbf7f8
98ab9d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196ed3d
98ab9d7
 
44da050
196ed3d
98ab9d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196ed3d
98ab9d7
 
196ed3d
98ab9d7
196ed3d
98ab9d7
196ed3d
98ab9d7
196ed3d
98ab9d7
 
 
196ed3d
98ab9d7
196ed3d
98ab9d7
 
196ed3d
98ab9d7
196ed3d
98ab9d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7bbf7f8
98ab9d7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
library_name: transformers
license: apache-2.0
base_model: Shekswess/trlm-stage-2-sft-final-2
tags:
- trl
- dpo
- preference-alignment
- reasoning
- generated_from_trainer
model-index:
- name: trlm-stage-3-dpo-final-2
  results: []
---

# Tiny Reasoning Language Model (trlm-135)

![image/png](https://github.com/user-attachments/assets/5f453496-8180-4cf4-94da-26ebbe1159d4)

## Table of Contents

1. [Model Summary](#model-summary)
2. [Post-Training Pipeline](#post-training-pipeline)
3. [How to use](#how-to-use)
4. [Training](#training)
5. [Evaluation](#evaluation)
6. [Limitations](#limitations)
7. [Acknowledgements](#acknowledgements)
8. [License](#license)
---

## Model Summary

The **Tiny Reasoning Language Model (trlm-135)** is a **135M parameter** research prototype designed to study how small models can learn step-by-step reasoning.
It was built on top of [SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) and fine-tuned through a **3-stage pipeline**:

* **[Stage 1 SFT](https://huggingface.co/Shekswess/trlm-stage-1-sft-final-2)**: general instruction tuning (non-reasoning).
* **[Stage 2 SFT](https://huggingface.co/Shekswess/trlm-stage-2-sft-final-2)**: reasoning traces with `<think>` tags.
* **[Stage 3 DPO](https://huggingface.co/Shekswess/trlm-stage-3-dpo-final-2)**: preference alignment for reasoning style.

The **code** for everything can be found **[here](https://github.com/Shekswess/tiny-reasoning-language-model/blob/main/README.md)**

---

## Post-Training Pipeline
<img width="1014" height="563" alt="image" src="https://github.com/user-attachments/assets/195ef389-6aa9-4527-b4f0-bea68c0841ae" />

---

## How to use

```bash
pip install -U transformers accelerate
```

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shekswess/trlm-135m"
device = "cuda"  # or "cpu"

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)

# Example prompt
prompt = "Give me a brief explanation of gravity in simple terms."
messages = [
    {"role": "user", "content": prompt}
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

> [!TIP]
> For reasoning-heavy tasks, set `temperature=0.6` and `top_p=0.95`.

---

## Training

### Model

* **Architecture**: Decoder-only transformer (SmolLM2 backbone which infact is Llama 3 based model).
* **Parameters**: ~135M.
* **Precision**: mix-precision (bfloat16) during training.

### Software & Hardware

* **Training Frameworks**: PyTorch (ROCm), Hugging Face Transformers & TRL.
* **Hardware**: AMD MI300X (192GB VRAM, 224GB RAM).

**Special thanks to [@HotAisle](https://x.com/HotAisle)**

### Training Stages

1. **Stage 1 – SFT (non-reasoning)**
   * ~58k samples, everyday conversations & instruction following.
2. **Stage 2 – SFT (reasoning)**
   * ~78k samples with `<think>` segments.
3. **Stage 3 – DPO (alignment)**
   * ~50k preference pairs (chosen vs. rejected reasoning traces).
---

## Evaluation

Evaluation was done with `lm-eval-harness`:

| **Benchmark**        | **Tiny Reasoning Language Model (trlm-135M)**  | **SmolLM2-135M-Instruct** | **Improvements** |
| -------------------- | ---------------------------- | ------------------------- | ---------------------------- |
| **ARC Challenge**    | **40.61** (avg)              | 37.3 (avg)                | **+3.31**                    |
| **BBH**              | **36.80** (3-shot)           | 28.2 (3-shot)             | **+8.6**                     |
| **BoolQ**            | **62.17**                    | –                         | N/A                          |
| **GSM8K**            | **2.59** (5-shot)            | 1.4 (5-shot)              | **+1.19**                    |
| **IFEval**           | **35.49** (avg)              | 29.9 (avg)                | **+5.59**                    |
| **MMLU**             | **34.95**                    | 29.3                      | **+5.65**                    |
| **PIQA**             | **64.91**                    | 66.3                      | **–1.39**                    |
| **HellaSwag**        | –                            | 40.9                      | N/A                          |
| **MT-Bench**         | –                            | 19.8                      | N/A                          |

---

## Limitations

* **Not production-ready**: hallucinations and logical errors are frequent.
* **Small size**: limited general knowledge and reasoning depth.
* **English-only**: multilingual capabilities not explored.

---

## Acknowledgements

- [@HotAisle](https://x.com/HotAisle) for providing the compute resources to train all three stages on a awesome AMD MI300x setup.
- [@mkurman88](https://x.com/mkurman88) for ideas, feedback and code samples.
- [HuggingFaceTB team](https://huggingface.co/HuggingFaceTB) for SmolLM2-135M-Instruct model and the Smoltalk2 dataset collection.
- [@scottgeng00](https://huggingface.co/scottgeng00) for the OLmO-3-Preference-Mix-Deltas dataset.
- [@eliebakouchi](https://x.com/eliebakouch) for help with the tokenization.

---

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

---