trlm-135m / README.md

Shekswess

Upload 12 files

7bbf7f8 verified about 2 months ago

preview code

raw

history blame

3.89 kB

metadata

library_name: transformers
license: apache-2.0
base_model: Shekswess/trlm-stage-2-sft-final-2
tags:
  - trl
  - dpo
  - preference-alignment
  - reasoning
  - generated_from_trainer
model-index:
  - name: trlm-stage-3-dpo-final-2
    results: []

TRLm Stage 3 Banner

🧠 trlm-stage-3-dpo-final-2

trlm-stage-3-dpo-final-2 is the Stage 3 post-training model for the Tiny Reasoning Language Model (trlm) project.
This stage focuses on preference alignment using Direct Preference Optimization (DPO) with 50k preference pairs.

📖 Model Description

Base Model: Shekswess/trlm-stage-2-sft-final-2
Type: Causal Language Model (decoder-only transformer)
Stage: Post-training Stage 3 (DPO)
Objective: Align model outputs with human-preferred reasoning and answers by contrasting chosen vs rejected completions.

This stage improves the model’s alignment, coherence, and reasoning stability.

🎯 Intended Uses & Limitations

Intended Uses

Aligned reasoning assistant with structured <think> traces
Multi-turn reasoning with preference-optimized outputs
Safer, more useful responses for reasoning tasks

Limitations

Trained only on preference data → may inherit biases from source datasets
Limited parameter count (135M) restricts knowledge breadth
Still prone to hallucinations under complex reasoning chains

📊 Training Data

This model was trained on the dataset:
👉 Shekswess/trlm-dpo-stage-3-final-2

Dataset summary:

Entries: 50,000 preference pairs
Source: scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered
Focus: Preference alignment with chosen vs rejected responses

Source Dataset	Split	Entries	%
scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered	train	50,000	100%

⚙️ Training Procedure

Training Hyperparameters

Learning rate: 1e-5
Train batch size: 32
Eval batch size: 8
Gradient accumulation steps: 4
Total effective batch size: 128
Optimizer: AdamW (betas=(0.9, 0.999), eps=1e-08)
LR Scheduler: Cosine with minimum LR + warmup ratio 0.1
Epochs: 1
Seed: 42

Framework Versions

Transformers: 4.56.2
PyTorch: 2.7.1+rocm7.0.0.git698b58a9
Datasets: 4.0.0
Tokenizers: 0.22.1

🚀 Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Shekswess/trlm-stage-3-dpo-final-2"

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example inference with preference-aligned reasoning
messages = [
    {"role": "user", "content": "Explain why the sky is blue in simple terms."}
]

# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Part of the Tiny Reasoning Language Model (trlm) post-training pipeline.