Tiny Reasoning Language Model (trlm-135)

image/png

Table of Contents

  1. Model Summary
  2. Post-Training Pipeline
  3. How to use
  4. Training
  5. Evaluation
  6. Limitations
  7. Acknowledgements
  8. License

Model Summary

The Tiny Reasoning Language Model (trlm-135) is a 135M parameter research prototype designed to study how small models can learn step-by-step reasoning. It was built on top of SmolLM2-135M-Instruct and fine-tuned through a 3-stage pipeline:

The code for everything can be found here


Post-Training Pipeline

image

How to use

pip install -U transformers accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shekswess/trlm-135m"
device = "cuda"  # or "cpu"

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)

# Example prompt
prompt = "Give me a brief explanation of gravity in simple terms."
messages = [
    {"role": "user", "content": prompt}
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For reasoning-heavy tasks, set temperature=0.6 and top_p=0.95.


Training

Model

  • Architecture: Decoder-only transformer (SmolLM2 backbone which infact is Llama 3 based model).
  • Parameters: ~135M.
  • Precision: mix-precision (bfloat16) during training.

Software & Hardware

  • Training Frameworks: PyTorch (ROCm), Hugging Face Transformers & TRL.
  • Hardware: AMD MI300X (192GB VRAM, 224GB RAM).

Special thanks to @HotAisle

Training Stages

  1. Stage 1 – SFT (non-reasoning)
    • ~58k samples, everyday conversations & instruction following.
  2. Stage 2 – SFT (reasoning)
    • ~78k samples with <think> segments.
  3. Stage 3 – DPO (alignment)
    • ~50k preference pairs (chosen vs. rejected reasoning traces).

Evaluation

Evaluation was done with lm-eval-harness:

Benchmark Tiny Reasoning Language Model (trlm-135M) SmolLM2-135M-Instruct Improvements
ARC Challenge 40.61 (avg) 37.3 (avg) +3.31
BBH 36.80 (3-shot) 28.2 (3-shot) +8.6
BoolQ 62.17 – N/A
GSM8K 2.59 (5-shot) 1.4 (5-shot) +1.19
IFEval 35.49 (avg) 29.9 (avg) +5.59
MMLU 34.95 29.3 +5.65
PIQA 64.91 66.3 –1.39
HellaSwag – 40.9 N/A
MT-Bench – 19.8 N/A

Limitations

  • Not production-ready: hallucinations and logical errors are frequent.
  • Small size: limited general knowledge and reasoning depth.
  • English-only: multilingual capabilities not explored.

Acknowledgements

  • @HotAisle for providing the compute resources to train all three stages on a awesome AMD MI300x setup.
  • @mkurman88 for ideas, feedback and code samples.
  • HuggingFaceTB team for SmolLM2-135M-Instruct model and the Smoltalk2 dataset collection.
  • @scottgeng00 for the OLmO-3-Preference-Mix-Deltas dataset.
  • @eliebakouchi for help with the tokenization.

License

Apache 2.0


Downloads last month
618
Safetensors
Model size
135M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shekswess/trlm-135m

Collection including Shekswess/trlm-135m