metadata

license: apache-2.0
tags:
  - dpo
  - unsloth
  - trl
  - qwen
  - instruction-tuning
  - preference-modeling
  - mnlp
datasets:
  - Tandogan/sft_dataset_final_train
  - Tandogan/MNLP_M2_dpo_dataset
base_model: Qwen/Qwen3-0.6B-Base
inference: false

MNLP M2 DPO Model — Qwen3-0.6B Fine-Tuned with Direct Preference Optimization

This repository contains a Direct Preference Optimization (DPO) model built on top of a supervised fine-tuned version of Qwen/Qwen3-0.6B-Base, as part of the MNLP M2 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences.

Model Description

Base Model: Qwen/Qwen3-0.6B-Base
SFT Checkpoint: Tandogan/MNLP_M2_SFT
DPO Dataset: Tandogan/MNLP_M2_dpo_dataset
Libraries: Unsloth, TRL

Training Procedure

Supervised Fine-Tuning (SFT)

Dataset: Tandogan/sft_dataset_final_train
(Alpaca-style prompt–completion pairs)
Max sequence length: 2048
Epochs: 4
Optimizer: AdamW (learning rate = 3e-5, weight decay = 0)
Precision: bf16
Batch size: 2 (gradient accumulation = 4)
Scheduler: Linear with 1% warmup
Eval & Checkpointing: Every epoch

Direct Preference Optimization (DPO)

Two DPO fine-tuning experiments were run:

1. From Base Model (`Qwen3-0.6B-Base`)

2. From SFT Model (`Tandogan/MNLP_M2_SFT`)

Dataset: Tandogan/MNLP_M2_dpo_dataset
Max sequence length: 2048 (prompt + completions truncated to 1024 each)
Epochs: 4
Optimizer: AdamW (learning rate = 2e-6, weight decay = 0)
Precision: bf16
Batch size: 2 (gradient accumulation = 4)
Scheduler: Cosine with 1% warmup
DPO Beta: 0.1
Eval & Checkpointing: Every epoch
Monitoring: Weights & Biases (WandB)
Best Epoch Selection: Based on validation loss

Intended Use

This model is intended for research and experimentation with preference-based alignment and reward modeling. It is not production-ready and may produce hallucinated, biased, or unsafe outputs. Please evaluate carefully for downstream tasks.

How to Use

You can use the model with the transformers and trl libraries for inference or evaluation:

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M2_dpo_model").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M2_dpo_model")

prompt = "Explain recursion in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))