Shekswess commited on
Commit
7bbf7f8
·
verified ·
1 Parent(s): 369877b

Upload 12 files

Browse files
README.md CHANGED
@@ -1,3 +1,115 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: Shekswess/trlm-stage-2-sft-final-2
5
+ tags:
6
+ - trl
7
+ - dpo
8
+ - preference-alignment
9
+ - reasoning
10
+ - generated_from_trainer
11
+ model-index:
12
+ - name: trlm-stage-3-dpo-final-2
13
+ results: []
14
+ ---
15
+
16
+ <p align="center">
17
+ <img src="https://sdmntprnortheu.oaiusercontent.com/files/00000000-f580-61f4-9d8f-e2ad1ad30cb1/raw?se=2025-09-28T13%3A44%3A27Z&sp=r&sv=2024-08-04&sr=b&scid=d18de0ac-b41e-5d89-82aa-2a8c74df25d6&skoid=f28c0102-4d9d-4950-baf0-4a8e5f6cf9d4&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2025-09-27T15%3A59%3A48Z&ske=2025-09-28T15%3A59%3A48Z&sks=b&skv=2024-08-04&sig=CSrmTwUK5za43FjSFhOlkzGlLkqG2CDPpKYkYtSdV6g%3D" alt="TRLm Stage 3 Banner" width="800"/>
18
+ </p>
19
+
20
+ # 🧠 trlm-stage-3-dpo-final-2
21
+
22
+ `trlm-stage-3-dpo-final-2` is the **Stage 3** post-training model for the **Tiny Reasoning Language Model (trlm)** project.
23
+ This stage focuses on **preference alignment** using **Direct Preference Optimization (DPO)** with 50k preference pairs.
24
+
25
+ ---
26
+
27
+ ## 📖 Model Description
28
+
29
+ - **Base Model**: [Shekswess/trlm-stage-2-sft-final-2](https://huggingface.co/Shekswess/trlm-stage-2-sft-final-2)
30
+ - **Type**: Causal Language Model (decoder-only transformer)
31
+ - **Stage**: Post-training **Stage 3 (DPO)**
32
+ - **Objective**: Align model outputs with human-preferred reasoning and answers by contrasting **chosen** vs **rejected** completions.
33
+
34
+ This stage improves the model’s **alignment**, **coherence**, and **reasoning stability**.
35
+
36
+ ---
37
+
38
+ ## 🎯 Intended Uses & Limitations
39
+
40
+ ### Intended Uses
41
+ - Aligned reasoning assistant with structured `<think>` traces
42
+ - Multi-turn reasoning with preference-optimized outputs
43
+ - Safer, more useful responses for reasoning tasks
44
+
45
+ ### Limitations
46
+ - Trained only on preference data → may inherit biases from source datasets
47
+ - Limited parameter count (135M) restricts knowledge breadth
48
+ - Still prone to hallucinations under complex reasoning chains
49
+
50
+ ---
51
+
52
+ ## 📊 Training Data
53
+
54
+ This model was trained on the dataset:
55
+ 👉 [**Shekswess/trlm-dpo-stage-3-final-2**](https://huggingface.co/datasets/Shekswess/trlm-dpo-stage-3-final-2)
56
+
57
+ **Dataset summary**:
58
+ - **Entries**: 50,000 preference pairs
59
+ - **Source**: `scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered`
60
+ - **Focus**: Preference alignment with **chosen vs rejected responses**
61
+
62
+ | Source Dataset | Split | Entries | % |
63
+ |----------------|-------|---------|---|
64
+ | scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered | train | 50,000 | 100% |
65
+
66
+ ---
67
+
68
+ ## ⚙️ Training Procedure
69
+
70
+ ### Training Hyperparameters
71
+ - **Learning rate**: 1e-5
72
+ - **Train batch size**: 32
73
+ - **Eval batch size**: 8
74
+ - **Gradient accumulation steps**: 4
75
+ - **Total effective batch size**: 128
76
+ - **Optimizer**: AdamW (betas=(0.9, 0.999), eps=1e-08)
77
+ - **LR Scheduler**: Cosine with minimum LR + warmup ratio 0.1
78
+ - **Epochs**: 1
79
+ - **Seed**: 42
80
+
81
+ ### Framework Versions
82
+ - **Transformers**: 4.56.2
83
+ - **PyTorch**: 2.7.1+rocm7.0.0.git698b58a9
84
+ - **Datasets**: 4.0.0
85
+ - **Tokenizers**: 0.22.1
86
+
87
+ ---
88
+
89
+ ## 🚀 Usage
90
+
91
+ ```python
92
+ from transformers import AutoTokenizer, AutoModelForCausalLM
93
+
94
+ model_name = "Shekswess/trlm-stage-3-dpo-final-2"
95
+
96
+ # Load tokenizer & model
97
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
98
+ model = AutoModelForCausalLM.from_pretrained(model_name)
99
+
100
+ # Example inference with preference-aligned reasoning
101
+ messages = [
102
+ {"role": "user", "content": "Explain why the sky is blue in simple terms."}
103
+ ]
104
+
105
+ # Apply chat template
106
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
107
+ inputs = tokenizer([text], return_tensors="pt")
108
+
109
+ outputs = model.generate(**inputs, max_new_tokens=256)
110
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
111
+ ```
112
+
113
+ ---
114
+
115
+ Part of the Tiny Reasoning Language Model (trlm) post-training pipeline.
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "</think>": 49153,
3
+ "<think>": 49152
4
+ }
chat_template.jinja ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {% for message in messages %}
2
+ {% if loop.first and messages[0]['role'] != 'system' %}
3
+ {{ '<|im_start|>system\nYou are a helpful AI assistant named Tiny Reasoning Language Model, trained by Shekswess. You are an assistant, with the ability to do reasoning. When performing reasoning always perform your full chain of thought inside <think>...</think> before giving a final answer. You are always reasoning so always use <think> </think> tags.<|im_end|>\n' }}
4
+ {% endif %}
5
+ {{ '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>\n' }}
6
+ {% endfor %}
7
+ {% if add_generation_prompt %}
8
+ {{ '<|im_start|>assistant\n' }}
9
+ {% endif %}
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 2,
10
+ "head_dim": 64,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 576,
13
+ "initializer_range": 0.041666666666666664,
14
+ "intermediate_size": 1536,
15
+ "is_llama_config": true,
16
+ "max_position_embeddings": 8192,
17
+ "mlp_bias": false,
18
+ "model_type": "llama",
19
+ "num_attention_heads": 9,
20
+ "num_hidden_layers": 30,
21
+ "num_key_value_heads": 3,
22
+ "pad_token_id": 2,
23
+ "pretraining_tp": 1,
24
+ "rms_norm_eps": 1e-05,
25
+ "rope_interleaved": false,
26
+ "rope_scaling": null,
27
+ "rope_theta": 100000,
28
+ "tie_word_embeddings": true,
29
+ "transformers.js_config": {
30
+ "kv_cache_dtype": {
31
+ "fp16": "float16",
32
+ "q4f16": "float16"
33
+ }
34
+ },
35
+ "transformers_version": "4.56.2",
36
+ "use_cache": true,
37
+ "vocab_size": 49154
38
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": [
5
+ 2
6
+ ],
7
+ "pad_token_id": 2,
8
+ "transformers_version": "4.56.2"
9
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f72be34d84f9d95d2f1ba9e6aa5d352ec841ce1e7aed81998135ce5bf96e9a09
3
+ size 269062856
special_tokens_map.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<think>",
4
+ "</think>"
5
+ ],
6
+ "bos_token": {
7
+ "content": "<|im_start|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "eos_token": {
14
+ "content": "<|im_end|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "pad_token": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "unk_token": {
28
+ "content": "<|endoftext|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<repo_name>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<reponame>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "5": {
45
+ "content": "<file_sep>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "6": {
53
+ "content": "<filename>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "7": {
61
+ "content": "<gh_stars>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "8": {
69
+ "content": "<issue_start>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "9": {
77
+ "content": "<issue_comment>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "10": {
85
+ "content": "<issue_closed>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "11": {
93
+ "content": "<jupyter_start>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "12": {
101
+ "content": "<jupyter_text>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "13": {
109
+ "content": "<jupyter_code>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "14": {
117
+ "content": "<jupyter_output>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "15": {
125
+ "content": "<jupyter_script>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "16": {
133
+ "content": "<empty_output>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ },
140
+ "49152": {
141
+ "content": "<think>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": true
147
+ },
148
+ "49153": {
149
+ "content": "</think>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": true
155
+ }
156
+ },
157
+ "additional_special_tokens": [
158
+ "<think>",
159
+ "</think>"
160
+ ],
161
+ "bos_token": "<|im_start|>",
162
+ "clean_up_tokenization_spaces": false,
163
+ "eos_token": "<|im_end|>",
164
+ "extra_special_tokens": {},
165
+ "model_max_length": 8192,
166
+ "pad_token": "<|im_end|>",
167
+ "tokenizer_class": "GPT2Tokenizer",
168
+ "unk_token": "<|endoftext|>",
169
+ "vocab_size": 49152
170
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:247b792c868f8245ceddd15c2b2486a99317401202045e562ed34e945a36ed82
3
+ size 6865
vocab.json ADDED
The diff for this file is too large to render. See raw diff