metadata
library_name: transformers
license: apache-2.0
base_model: Shekswess/trlm-stage-2-sft-final-2
tags:
- trl
- dpo
- preference-alignment
- reasoning
- generated_from_trainer
model-index:
- name: trlm-stage-3-dpo-final-2
results: []
π§ trlm-stage-3-dpo-final-2
trlm-stage-3-dpo-final-2 is the Stage 3 post-training model for the Tiny Reasoning Language Model (trlm) project.
This stage focuses on preference alignment using Direct Preference Optimization (DPO) with 50k preference pairs.
π Model Description
- Base Model: Shekswess/trlm-stage-2-sft-final-2
- Type: Causal Language Model (decoder-only transformer)
- Stage: Post-training Stage 3 (DPO)
- Objective: Align model outputs with human-preferred reasoning and answers by contrasting chosen vs rejected completions.
This stage improves the modelβs alignment, coherence, and reasoning stability.
π― Intended Uses & Limitations
Intended Uses
- Aligned reasoning assistant with structured
<think>traces - Multi-turn reasoning with preference-optimized outputs
- Safer, more useful responses for reasoning tasks
Limitations
- Trained only on preference data β may inherit biases from source datasets
- Limited parameter count (135M) restricts knowledge breadth
- Still prone to hallucinations under complex reasoning chains
π Training Data
This model was trained on the dataset:
π Shekswess/trlm-dpo-stage-3-final-2
Dataset summary:
- Entries: 50,000 preference pairs
- Source:
scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered - Focus: Preference alignment with chosen vs rejected responses
| Source Dataset | Split | Entries | % |
|---|---|---|---|
| scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered | train | 50,000 | 100% |
βοΈ Training Procedure
Training Hyperparameters
- Learning rate: 1e-5
- Train batch size: 32
- Eval batch size: 8
- Gradient accumulation steps: 4
- Total effective batch size: 128
- Optimizer: AdamW (betas=(0.9, 0.999), eps=1e-08)
- LR Scheduler: Cosine with minimum LR + warmup ratio 0.1
- Epochs: 1
- Seed: 42
Framework Versions
- Transformers: 4.56.2
- PyTorch: 2.7.1+rocm7.0.0.git698b58a9
- Datasets: 4.0.0
- Tokenizers: 0.22.1
π Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Shekswess/trlm-stage-3-dpo-final-2"
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example inference with preference-aligned reasoning
messages = [
{"role": "user", "content": "Explain why the sky is blue in simple terms."}
]
# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Part of the Tiny Reasoning Language Model (trlm) post-training pipeline.