trlm-135m / README.md
Shekswess's picture
Upload 12 files
7bbf7f8 verified
|
raw
history blame
3.89 kB
metadata
library_name: transformers
license: apache-2.0
base_model: Shekswess/trlm-stage-2-sft-final-2
tags:
  - trl
  - dpo
  - preference-alignment
  - reasoning
  - generated_from_trainer
model-index:
  - name: trlm-stage-3-dpo-final-2
    results: []

TRLm Stage 3 Banner

🧠 trlm-stage-3-dpo-final-2

trlm-stage-3-dpo-final-2 is the Stage 3 post-training model for the Tiny Reasoning Language Model (trlm) project.
This stage focuses on preference alignment using Direct Preference Optimization (DPO) with 50k preference pairs.


πŸ“– Model Description

  • Base Model: Shekswess/trlm-stage-2-sft-final-2
  • Type: Causal Language Model (decoder-only transformer)
  • Stage: Post-training Stage 3 (DPO)
  • Objective: Align model outputs with human-preferred reasoning and answers by contrasting chosen vs rejected completions.

This stage improves the model’s alignment, coherence, and reasoning stability.


🎯 Intended Uses & Limitations

Intended Uses

  • Aligned reasoning assistant with structured <think> traces
  • Multi-turn reasoning with preference-optimized outputs
  • Safer, more useful responses for reasoning tasks

Limitations

  • Trained only on preference data β†’ may inherit biases from source datasets
  • Limited parameter count (135M) restricts knowledge breadth
  • Still prone to hallucinations under complex reasoning chains

πŸ“Š Training Data

This model was trained on the dataset:
πŸ‘‰ Shekswess/trlm-dpo-stage-3-final-2

Dataset summary:

  • Entries: 50,000 preference pairs
  • Source: scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered
  • Focus: Preference alignment with chosen vs rejected responses
Source Dataset Split Entries %
scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered train 50,000 100%

βš™οΈ Training Procedure

Training Hyperparameters

  • Learning rate: 1e-5
  • Train batch size: 32
  • Eval batch size: 8
  • Gradient accumulation steps: 4
  • Total effective batch size: 128
  • Optimizer: AdamW (betas=(0.9, 0.999), eps=1e-08)
  • LR Scheduler: Cosine with minimum LR + warmup ratio 0.1
  • Epochs: 1
  • Seed: 42

Framework Versions

  • Transformers: 4.56.2
  • PyTorch: 2.7.1+rocm7.0.0.git698b58a9
  • Datasets: 4.0.0
  • Tokenizers: 0.22.1

πŸš€ Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Shekswess/trlm-stage-3-dpo-final-2"

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example inference with preference-aligned reasoning
messages = [
    {"role": "user", "content": "Explain why the sky is blue in simple terms."}
]

# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Part of the Tiny Reasoning Language Model (trlm) post-training pipeline.