Model Card for Qwen3-0.6B-MNLP-DPO

This model is a Direct Preference Optimization (DPO) fine-tuned version of Qwen3-0.6B-Base using the Mehdi-Zogh/MNLP_M2_dpo_dataset. The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases.

Model Details

Model Description

This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts.

Developed by: Mehdi Zoghlami
Model type: Causal Language Model
Language(s): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen3-0.6B-Base
Dataset: Mehdi-Zogh/MNLP_M2_dpo_dataset

Uses

Direct Use

This model is trained to be an AI tutor that is specialized in course content at EPFL.

Downstream Use

It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems.

Out-of-Scope Use

Not recommended for use in high-stakes settings.
Not intended for use outside the English language.
Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks).

Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Mehdi-Zogh/MNLP_M2_dpo_model"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "explain gradient descent in simple terms."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Training Details

Training Data

The training data is the Mehdi-Zogh/MNLP_M2_dpo_dataset, which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods.

Training Procedure

The model was fine-tuned using trl's DPOTrainer

Training Hyperparameters

Hyperparameter	Value
Learning rate	1e-5
Epochs	3
Per-device train batch size	1
Per-device eval batch size	1
Gradient accumulation steps	4
Precision	bf16
Early stopping patience	3

Evaluation

320 samples out of the dataset were used for validation.

Testing Data, Factors & Metrics

Testing Data

The model was tested on zechen-nlp/MNLP_dpo_demo

Metrics

Accuracy of Preference: Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs.
This is a standard metric in DPO training to evaluate how well the model aligns with human preferences.

Results

The model achieved a preference accuracy of 84% ± 5.2% on the test set.
This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset.

Mehdi-Zogh
/

MNLP_M2_dpo_model